TF-IDF Keyword Analysis

How do we decide whether a piece of writing is good or not? One parameter is to measure its relevance to the subject at hand. A high quality document would be highly relevant to its topic of discussion, thus providing more information about its central topic in fewer words. Search engines use similar techniques to determine what is most relevant to a user. They calculate the relevance of a search keyword to all the available documents and list documents that have high relevance higher than documents that have low relevance in the search results.

As content creators we have to increase relevance of our content not just for humans but for search engines as well. Search engines rely on the structure of the content. To help search engines understand our content, we have to incorporate clues as to what our content means. A good starting point is to identify keywords of our topic of writing. While creating online content, these keywords are then strategically incorporated within elements of the page content such as title, headings and body to indicate the context of the page. However, keyword density is not sufficient for ensuring a high page rank, TF-IDF is also important.

What does TF-IDF stand for?

Term TF-IDF stands for term frequency – inverse document frequency. Simply put, TF-IDF calculates how important a keyword (i.e. term) is for a document in the context of collection of documents (also known as corpus). As an example, consider a term "the brown fox". Any English document will have high keyword density for "the" because it is occurs quite frequently within the document. However TF-IDF score for "the" would be low because it occurs quite frequently across all other English language documents too. "brown" is rarer than "the" hence documents that have high keyword density for "brown" would typically have high TF-IDF score.

TF-IDF keyword analysis is used majorly in document retrieval and text mining applications. Search engines, like Google, use TF-IDF to find documents that contain provided search query. A document that has high TF-IDF score for the search term is considered more relevant to the search term compared to a document that has low TF-IDF score.

How is TF-IDF computed?

TF-IDF for a term is calculated in two parts – term frequency (TF) is the normalised frequency of occurrence of a word in a document, and inverse document frequency (IDF) is a logarithm of the total number of documents divided by number of documents where the term appears.

Term frequency (TF) in its simplest implementations can be considered as the number of times the word occurs in the document. However longer documents tend to have higher TF for words, hence TF score is normalised by dividing the count of occurrences by total word count of the document. TF-IDF can be seen as a measure of amount of information provided by a keyword in a document collection – the higher the value of IDF for a keyword, more information it provides and hence is relevant for the document.

A challenge to measure TF-IDF is that TF can be measured by looking at the document in hand, however to measure IDF you would need access to entire document collection. In addition to that you would also need smart statistical algorithms that calculate this value over a large set of documents in a short period of time. While generating content for Internet, such as blogs, this becomes a humungous task due to the amount of content being generated online every day.

Keyword.io helps you analyse your keywords for their TF-IDF score. It provides two-step information - it shows the pages that are considered most relevant for the keyword in Google search. It also provides a list of keywords that are related to your keyword, which can be included in your content to increase its relevance. It calculates the score for TF-IDF based on a large corpus and is growing everyday. Before you starting generating content, you should analyse your keywords in Keyword.io.