Exploring TF-IDF: Insights into Information Retrieval

Srivaths Gondi
Jun 30, 2023
4 min read

Updated: Jul 11, 2023

In the realm of information retrieval, the utilization of queries or keywords to extract relevant information from documents, search engines, and chats is a common practice. However, the algorithms that drive this process, such as TF-IDF, often remain shrouded in doubt.

In this discussion, we will explore the TF-IDF algorithm, its applications in Natural Language Processing (NLP), and its pivotal role in textual representation and classification.

Let’s say you write a query:

In a hypothetical search engine that is entirely dependent on TF-IDF for information retrieval.

There could be three possible ways to fetch information:

1) Get results based on the entire sentence.

2) Look for results based on every Keyword in the question and merge the results

3) Search based on only a few Unique keywords.

Filtering results based on the entire question or sentence would result in the exclusion of a significant amount of useful information since it is uncommon for all the keywords to appear in the exact same order.

Conversely, searching for results based on each individual keyword independently is impractical as it would yield an abundance of irrelevant and unnecessary information.

WHY?

Search engines may have more than Billions of data containing keywords “do”, “to”, “get” etc…, combining the results for these keywords would be completely excessive and pointless.

Thus, the most optimal approach is to select a few crucial keywords, such as "How," "Cat," and "taxes," and conduct the search based on them. This method ensures a more relevant, refined, and efficient means of filtering out pertinent information.

But how exactly does the system know which keyword Is essential?

It is TF-IDF that primarily works for this use case. It ranks every keyword in a sentence based on their rarity in the entire set of data, which indeed provides a way to categorize keywords based on their importance.

To understand the specifics and working, it's necessary to know the following keywords:

Corpus: A corpus refers to a collection of written or spoken texts used for linguistic analysis and training language models.

Document: we often refer to a token of information as a document, which can commonly be considered as one phrase, or sentence that serves as an input.

Token: It is referred to as a subpart of information. A single word, phrase or even a sentence can be compromised.

For illustration let the corpus be:

D1- Grocery shops near me

D2- New supermarket Dmart opened

D3- Discounts on groceries and vegetables.

D4- Best nearby salons for me

D5- My Instagram profile Here D1,D2,D3 … are the documents

To compute the rankings for each keyword we must compute TF (Term frequency) and IDF (Inverse document frequency).

Note: All the documents must be in lowercase and stemmed before using the algorithm since, words like: “my”, “me” ; “Groceries” , ”Grocery”; “near”, “nearby” must be treated the same.

TF :

Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document.

TF of the term “Grocery” = ¼ TF of the term “near” = ¼

TF of the term “shops” = ¼ TF of the term “me” = ¼

IDF:

Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus. IDF is calculated as follows where t is the term (word) we are looking to measure the commonness of, and N is the number of documents (d) in the corpus (D). The denominator is simply the number of documents in which the term, t, appears.

A lot of times it's possible the term never appears in other documents, making the denominator zero, which can produce errors. To handle this, often 1 is added to the denominator.

IDF of the term “grocery” = 5/2 IDF of the term “shop” = 5/1

IDF of the term “near” = 5/2 IDF of the term “me” = 5/3

TF-IDF reading for the phrase

Now it's easier to interpret keywords’ importance based on their TF-IDF rankings and retain only the necessary keywords. Similarly, TF-IDF rankings can also be calculated for all the documents If the end goal is text classification.

The reason we use a logarithmic function to compute IDF is to prevent it from reaching excessively high values that could overshadow the term frequency value, thereby compromising the sensitivity of the calculation.

Just this way every search engine pinpoints the necessary keywords in your query and performs the search.

You might also be curious about, how these filtered results are ranked in the search engine?

Indeed, TF-IDF values are often represented as vectors for each document in the corpus. These vectors, containing TF-IDF values for different terms, can be plotted in a multi-dimensional space. The similarity between a query (represented as a vector) and the document vectors can be measured using angular distance or cosine similarity.

Documents with the least angular distance or the highest cosine similarity with the query vector are ranked higher, indicating their similarity or relevance to the query.

It's important to note that search engines consider various other factors when determining the ranking of content. These factors may differ between search engines and can include additional features like page authority, user behavior, recency, and other relevant signals. However, this should give you an insight into how search engines work.