What is keyword extraction? The TF-IDF formula? The RAKE? Click here to learn more about it!
Figure 1: "Keywords"
What is Keyword Extraction?
Keyword extraction is the process of trying to find terms in the text that best describes the overall meaning of the text. This process is incredibly useful when summarizing text. Search engines also make use of keyword extraction to find the key terms on webpages. The search engine then finds webpages that have key terms that match best with the search the user has made.
Term Frequency
The simplest approach to finding keywords would be to find the words that are most common in the text. This usually returns words such as ‘and’ or ‘the’ since they are widespread in English. These words are known as ‘stop words’ and should be filtered out as they do not provide any information about the text.
This method can work well on single documents but doesn’t work well when trying to find keywords in a collection of related documents (a corpus). This is because the most common words will be similar throughout all the documents (for example, in a book series the most common words in all of the books could be the names of the main characters). This doesn’t give us that much information about the meaning of the documents. Instead, we need to use a method that finds words that are common in one document and that aren’t common in the others since this will tell us what the document is about and how it differs from the others.
TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is a formula that can be used to calculate the importance of a word to a document that is in the corpus.
Figure 2: the TF-IDF formula
The formula works by multiplying the term frequency by the logarithm of the number of documents divided by the number of documents that contain the word. The higher the number of documents containing the word (the less unique a word is to a document), the closer the number of documents divided by the number of documents containing the word is to one. The logarithm of one is 0 meaning that the less unique a word is to a document, the lower its TF-IDF value is.
This means the higher the TF-IDF value is, the more unique a word is to a document. Therefore, the words with the highest TF-IDF values will give the best representation of what the document is about.
RAKE
RAKE (Rapid Automatic Keyword Extraction) is another algorithm that can be used for keyword extraction. Instead of finding words, RAKE finds key phrases. The algorithm works by splitting up a sentence into words and removing stop words and punctuation. Then, a co-occurrence matrix is formed which shows how often words appear side by side. This matrix is then used to find a set of candidate key phrases. Scores are then calculated for each term by dividing the degree of each word (the number of co-occurrences) by the frequency of each expression (the number of times each word appears). The cumulative score for each candidate's key phrase is then calculated and those with the highest scores are the key phrases.
The logic behind the RAKE algorithm is rather simple but the algorithm works very effectively which makes this algorithm a very popular choice when performing keyword extraction.
Personal Opinion
I feel that many of the current keyword extraction algorithms currently present, although impressive, are not as useful for summarising text as they may seem. This is because these extractions fail to consider synonyms. Synonyms are important as a document may use several similar words to convey the main message as opposed to repeatedly using the same one. Algorithms could make use of word vectors which represent the meanings of words and then find similar vectors that are used repeatedly.
Comments