TF-IDF for Dummies

In connection to the latest paper I wrote I had to implement some TF-IDF analysis and since I didn’t find a quick understandable overview on the internet in one place, I’ll just try and give a brief explanation of things here.

So what is it?  TF is short for Term Frequency and IDF means Inverse Document Frequency.

What you simply do to get the Term Frequency of a term is, that you count all the occurrences of that term in one document and divide that over the total number of occurrences of every term.  You can also do this with a simple piece of python code. What I did there additionally was, removing all the stop words of the English language.

def makeSnippetDictionary( snippets ):
    snippetdict = { }
    stopwords = [ ]
    wordlist = makeSnippetWordList( snippets )
    with open( "stopwords.txt" ) as f:
        stopwords = f.read().split( "," )
    for word in wordlist:
        if not word in stopwords:
            if snippetdict.has_key( word ):
                snippetdict[word] += 1
            else:
                snippetdict[word] = 1
    for word, value in snippetdict.items():
        snippetdict[word] = float( value ) / ( len( wordlist ) )
    return snippetdict

So this is just one part of the code, but it tells pretty much what it does.  I split the document from its completely lower cased string format into one list of words..  Then I converted it into a dictionary (for other programming languages: a map or mapping) and counted how often the same word occurs.   The line before the return statement then just divides each value by the length of the list of words and there you go: relative term frequency done!  Of course you can do some sorting then to make it look prettier, but that’s basically it.

Now how to calculate/implement the inverse document frequency?  Good that you asked!  The inverse document frequency describes the relation of the general occurrence of a term in your document collection.  To solidly calculate this, you should use at least 2 different documents that are relevant to your analysis, though the more the better.  You then basically divide your number of documents by the number of documents the term occurred in and take the logarithm from it.  Multiplying that with the relative term frequency gives you the much wanted TF-IDF values.

def idfCorpus( snippets, reviews ):
    wlreviews = makeWordListDict( reviews )
    idf = { }
    for tf, terms in snippetTFRank( snippets ):
        for word in terms:
           for reviewnr, review in wlreviews.items():
               if word in review:
                   idf[word] = idf.get( word, 0 ) + 1
    for word, df in idf.items():
        idf[word] = math.log( len( wlreviews )/idf[word] )
    return idf

def tfidfCorpus( snippets, reviews ):
    tf = makeSnippetDictionary( snippets )
    idf = idfCorpus( snippets, reviews )
    tfidf = { }
    for term, frequency in tf.items():
        tfidf[term] = tf.get( term ) * idf.get( term, 0 )
    return tfidf

The piece of python above basically shows you how to do this.  To interpret it there is basically one rule:  The higher the TF-IDF value of a term in one document the more important it is.  Bear in mind that the relative term frequency is different for every single document and applying TF-IDF values should be done in such a way, that they are different for every document you have.

So why would you want to use this?  In my case it was that I wanted to determine whether certain parts within product reviews (in the code called snippets) have a higher likelihood for certain words (yes they do!).  Implementing a TF-IDF analysis in your text processing machine learning algorithm can help improving it!

Leave a comment

Your email address will not be published. Required fields are marked *

5 × five =

This site uses Akismet to reduce spam. Learn how your comment data is processed.