NLTK for hackers

In this article written by Nitin Hardeniya, author of the book NLTK Essentials, we will learn that "Life is short, we need Python" that's the mantra I follow and truly believe in. As fresh graduates, we learned and worked mostly with C/C++/JAVA. While these languages have amazing features, Python has a charm of its own. The day I started using Python I loved it. I really did. The big coincidence here is that I finally ended up working with Python during my initial projects on the job. I started to love the kind of datastructures, Libraries, and echo system Python has for beginners as well as for an expert programmer.

(For more resources related to this topic, see here.)

Python as a language has advanced very fast and spatially. If you are a Machine learning/ Natural language Processing enthusiast, then Python is 'the' go-to language these days. Python has some amazing ways of dealing with strings. It has a very easy and elegant coding style, and most importantly a long list of open libraries. I can go on and on about Python and my love for it. But here I want to talk about very specifically about NLTK (Natural Language Toolkit), one of the most popular Python libraries for Natural language processing.

NLTK is simply awesome, and in my opinion,it's the best way to learn and implement some of the most complex NLP concepts. NLTK has variety of generic text preprocessing tool, such as Tokenization, Stop word removal, Stemming, and at the same time,has some very NLP-specific tools,such as Part of speech tagging, Chunking, Named Entity recognition, and dependency parsing.

NLTK provides some of the easiest solutions to all the above stages of NLP and that's why it is the most preferred library for any text processing/ text mining application. NLTK not only provides some pretrained models that can be applied directly to your dataset, it also provides ways to customize and build your own taggers, tokenizers, and so on.

NLTK is a big library that has many tools available for an NLP developer. I have provided a cheat-sheet of some of the most common steps and their solutions using NLTK. In our book, NLTK Essentials, I have tried to give you enough information to deal with all these processing steps using NLTK.

To show you the power of NLTK, let's try to develop a very easy application of finding topics in the unstructured text in a word cloud.

Word CloudNLTK

Instead of going further into the theoretical aspects of natural language processing, let's start with a quick dive into NLTK. I am going to start with some basic example use cases of NLTK. There is a good chance that you have already done something similar. First, I will give a typical Python programmer approach and then move on to NLTK for a much more efficient, robust, and clean solution.

We will start analyzing with some example text content:

>>>import urllib2
>>># urllib2 is use to download the html content of the web link
>>>response = urllib2.urlopen('http://python.org/')
>>># You can read the entire content of a file using read() method
>>>html = response.read()
>>>print len(html)
47020

For the current example, I have taken the content from Python's home page: https://www.python.org/.

We don't have any clue about the kind of topics that are discussed in this URL, so let's say that we want to start an exploratory data analysis (EDA). Typically in a text domain, EDA can have many meanings, but will go with a simple case of what kinds of terms dominate the documents. What are the topics? How frequent are they? The process will involve some level of preprocessing we will try to do this in a pure Python wayand then we will do it using NLTK.

Let's start with cleaning the html tags. One way to do this is to select just tokens, including numbers and character. Anybody who has worked with regular expression should be able to convert html string into a list of tokens:

>>># regular expression based split the string
>>>tokens = [tok for tok in html.split()]
>>>print "Total no of tokens :"+ str(len(tokens))
>>># first 100 tokens
>>>print tokens[0:100]
Total no of tokens :2860
['<!doctype', 'html>', '<!--[if', 'lt', 'IE', '7]>', '<html', 'class="no-js', 'ie6', 'lt-ie7',
'lt-ie8', 'lt-ie9">',
 '<![endif]-->', '<!--[if', 'IE', '7]>', '<html', 'class="no-js', 'ie7', 'lt-ie8', 'lt-ie9">',
'<![endif]-->',
 ''type="text/css"', 'media="not', 'print,', 'braille,' ...]

As you can see, there is an excess of html tags and other unwanted characters when we use the preceding method. A cleaner version of the same task will look something like this:

>>>import re
>>># using the split function https://docs.python.org/2/library/re.html
>>>tokens = re.split('\W+',html)
>>>print len(tokens)
>>>print tokens[0:100]
5787
['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt',
'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt',
'ie9', 'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The', 'official', 'home', 'of',
'the', 'Python', 'Programming', 'Language', 'meta', 'name', 'apple' ...]

This looks much cleaner now. But still you can do more; I leave it to you to try to remove as much noise as you can. You can still look for word length as a criteria and remove words that have a length one—it will remove elements,such as 7, 8, and so on, which are just noise in this case. Now let's go to NLTK for the same task. There is a function called clean_html() that can do all the work we were looking for:

>>>import nltk
>>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html
>>>clean = nltk.clean_html(html)
>>># clean will have entire string removing all the html noise
>>>tokens = [tok for tok in clean.split()]
>>>print tokens[:100]
['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '&#9660;', 'Close', 'Python', 'PSF', 'Docs',
 'PyPI', 'Jobs', 'Community', '&#9650;', 'The', 'Python', 'Network', '&equiv;', 'Menu', 'Arts',
'Business' ...]

Cool, right? This definitely is much cleaner and easier to do.

No analysis in any EDA can start without distribution. Let's try to get the frequency distribution. First, let's do it the Python way, then I will tell you the NLTK recipe.

>>>import operator
>>>freq_dis={}
>>>for tok in tokens:
>>>    if tok in freq_dis:
>>>        freq_dis[tok]+=1
>>>    else:
>>>        freq_dis[tok]=1
>>># We want to sort this dictionary on values ( freq in this case )
>>>sorted_freq_dist= sorted(freq_dis.items(), key=operator.itemgetter(1), reverse=True)
>>> print sorted_freq_dist[:25]
[('Python', 55), ('>>>', 23), ('and', 21), ('to', 18), (',', 18), ('the', 14), ('of', 13), ('for', 12),
 ('a', 11), ('Events', 11), ('News', 11), ('is', 10), ('2014-', 10), ('More', 9), ('#', 9), ('3', 9),
 ('=', 8), ('in', 8), ('with', 8), ('Community', 7), ('The', 7), ('Docs', 6), ('Software', 6), (':', 6),
 ('3:', 5), ('that', 5), ('sum', 5)]

Naturally, as this is Python's home page, Python and the >>> interpreters are the most common terms, also giving a sense about the website.

A better and efficient approach is to use NLTK's FreqDist() function. For this, we will take a look at the same code we developed before:

>>>import nltk
>>>Freq_dist_nltk=nltk.FreqDist(tokens)
>>>print Freq_dist_nltk
>>>for k,v in Freq_dist_nltk.items():
>>>    print str(k)+':'+str(v)
<FreqDist: 'Python': 55, '>>>': 23, 'and': 21, ',': 18, 'to': 18, 'the': 14,
 'of': 13, 'for': 12, 'Events': 11, 'News': 11, ...>
Python:55
>>>:23
and:21
,:18
to:18
the:14
of:13
for:12
Events:11
News:11

Let's now do some more funky things. Let's plot this:

>>>Freq_dist_nltk.plot(50, cumulative=False)
>>># below is the plot for the frequency distributions

We can see that the cumulative frequency is growing, and at words such as other and frequency 400, the curve is going into long tail. Still, there is some noise, and there are words such asthe, of, for, and =. These are useless words, and there is a terminology for these words. These words are stop words,such asthe, a, and an. Article pronouns are generally present in most of the documents; hence, they are not discriminative enough to be informative. In most of the NLP and information retrieval tasks, people generally remove stop words. Let's go back again to our running example:

>>>stopwords=[word.strip().lower() for word in open("PATH/english.stop.txt")]
>>>clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)]
>>>Freq_dist_nltk=nltk.FreqDist(clean_tokens)
>>>Freq_dist_nltk.plot(50, cumulative=False)

This looks much cleaner now! After finishing this much, you should be able to get something like this using word cloud:

Please go to http://www.wordle.net/advanced for more word clouds.

Summary

To summarize, this article was intended to give you a brief introduction toNatural Language Processing. The book does assume some background in NLP andprogramming in Python, but we have tried to give a very quick head start to Pythonand NLP.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

NLTK Essentials

Explore Title