Python Text Processing with NLTK: Storing Frequency Distributions in Redis

Exclusive offer: get 50% off this eBook here
Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook — Save 50%

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

€18.99    €9.50
by Jacob Perkins | November 2010 | Open Source

Natural Language Processing is used everywhere—in search engines, spell checkers, mobile phones, computer games, and even in your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing.

In the previous article by Jacob Perkins, author of the book Python Text Processing with NLTK 2.0 Cookbook, we saw how to use execnet to do parallel and distributed processing with NLTK.

In this article, we will cover:

  • Storing a frequency distribution in Redis
  • Storing a conditional frequency distribution in Redis
  • Storing an ordered dictionary in Redis
  • Distributed word scoring with Redis and execnet

 

Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.

  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible
        Read more about this book      

(For more resources on Python, see here.)

Storing a frequency distribution in Redis`

Redis is a data structure server that is one of the more popular NoSQL databases. Among other things, it provides a network accessible database for storing dictionaries (also known as hash maps). Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time.

Most Redis operations are atomic, so it's even possible to have multiple processes write to the FreqDist concurrently.

Getting ready

For this and subsequent recipes, we need to install both Redis and redis-py. A quick start install guide for Redis is available at http://code.google.com/p/redis/wiki/ QuickStart. To use hash maps, you should install at least version 2.0.0 (the latest version as of this writing).

The Redis Python driver redis-py can be installed using pip install redis or easy_ install redis. Ensure you install at least version 2.0.0 to use hash maps. The redispy homepage is at http://github.com/andymccurdy/redis-py/.

Once both are installed and a redis-server process is running, you're ready to go. Let's assume redis-server is running on localhost on port 6379 (the default host and port).

How to do it...

The FreqDist class extends the built-in dict class, which makes a FreqDist an enhanced dictionary. The FreqDist class provides two additional key methods: inc() and N(). The inc() method takes a single sample argument for the key, along with an optional count keyword argument that defaults to 1, and increments the value at sample by count. N() returns the number of sample outcomes, which is the sum of all the values in the frequency distribution.

We can create an API-compatible class on top of Redis by extending a RedisHashMap (that will be explained in the next section), then implementing the inc() and N() methods. Since the FreqDist only stores integers, we also override a few other methods to ensure values are always integers. This RedisHashFreqDist (defined in redisprob.py) uses the hincrby command for the inc() method to increment the sample value by count, and sums all the values in the hash map for the N() method.

from rediscollections import RedisHashMap
class RedisHashFreqDist(RedisHashMap):
def inc(self, sample, count=1):
self._r.hincrby(self._name, sample, count)
def N(self):
return int(sum(self.values()))
def __getitem__(self, key):
return int(RedisHashMap.__getitem__(self, key) or 0)
def values(self):
return [int(v) for v in RedisHashMap.values(self)]
def items(self):
return [(k, int(v)) for (k, v) in RedisHashMap.items(self)]

We can use this class just like a FreqDist. To instantiate it, we must pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn't clash with any other keys in Redis.

>>> from redis import Redis
>>> from redisprob import RedisHashFreqDist
>>> r = Redis()
>>> rhfd = RedisHashFreqDist(r, 'test')
>>> len(rhfd)
0
>>> rhfd.inc('foo')
>>> rhfd['foo']
1
>>> rhfd.items()
>>> len(rhfd)
1

The name of the hash map and the sample keys will be encoded to replace whitespace and & characters with _. This is because the Redis protocol uses these characters for communication. It's best if the name and keys don't include whitespace to begin with.

How it works...

Most of the work is done in the RedisHashMap class, found in rediscollections.py, which extends collections.MutableMapping, then overrides all methods that require Redis-specific commands. Here's an outline of each method that uses a specific Redis command:

  • __len__(): Uses the hlen command to get the number of elements in the hash map
  • __contains__(): Uses the hexists command to check if an element exists in the hash map
  • __getitem__(): Uses the hget command to get a value from the hash map
  • __setitem__(): Uses the hset command to set a value in the hash map
  • __delitem__(): Uses the hdel command to remove a value from the hash map
  • keys(): Uses the hkeys command to get all the keys in the hash map
  • values(): Uses the hvals command to get all the values in the hash map
  • items(): Uses the hgetall command to get a dictionary containing all the keys and values in the hash map
  • clear(): Uses the delete command to remove the entire hash map from Redis

Extending collections.MutableMapping provides a number of other dict compatible methods based on the previous methods, such as update() and setdefault(), so we don't have to implement them ourselves.

The initialization used for the RedisHashFreqDist is actually implemented here, and requires a Redis connection and a name for the hash map. The connection and name are both stored internally to use with all the subsequent commands. As mentioned before, whitespace is replaced by underscore in the name and all keys, for compatibility with the Redis network protocol.

import collections, re
white = r'[\s&]+'
def encode_key(key):
return re.sub(white, '_', key.strip())
class RedisHashMap(collections.MutableMapping):
def __init__(self, r, name):
self._r = r
self._name = encode_key(name)
def __iter__(self):
return iter(self.items())
def __len__(self):
return self._r.hlen(self._name)
def __contains__(self, key):
return self._r.hexists(self._name, encode_key(key))
def __getitem__(self, key):
return self._r.hget(self._name, encode_key(key))
def __setitem__(self, key, val):
self._r.hset(self._name, encode_key(key), val)
def __delitem__(self, key):
self._r.hdel(self._name, encode_key(key))
def keys(self):
return self._r.hkeys(self._name)
def values(self):
return self._r.hvals(self._name)
def items(self):
return self._r.hgetall(self._name).items()
def get(self, key, default=0):
return self[key] or default
def iteritems(self):
return iter(self)
def clear(self):
self._r.delete(self._name)

There's more...

The RedisHashMap can be used by itself as a persistent key-value dictionary. However, while the hash map can support a large number of keys and arbitrary string values, its storage structure is more optimal for integer values and smaller numbers of keys. However, don't let that stop you from taking full advantage of Redis. It's very fast (for a network server) and does its best to efficiently encode whatever data you throw at it.

While Redis is quite fast for a network database, it will be significantly slower than the in-memory FreqDist. There's no way around this, but while you sacrifice speed, you gain persistence and the ability to do concurrent processing.

See also

In the next recipe, we'll create a conditional frequency distribution based on the Redis frequency distribution created here.

Storing a conditional frequency distribution in Redis

The nltk.probability.ConditionalFreqDist class is a container for FreqDist instances, with one FreqDist per condition. It is used to count frequencies that are dependent on another condition, such as another word or a class label. Here, we'll create an API-compatible class on top of Redis using the RedisHashFreqDist from the previous recipe.

Getting ready

As in the previous recipe, you'll need to have Redis and redis-py installed with an instance of redis-server running.

How to do it...

We define a RedisConditionalHashFreqDist class in redisprob.py that extends nltk.probability.ConditionalFreqDist and overrides the __contains__() and __getitem__() methods. We then override __getitem__() so we can create an instance of RedisHashFreqDist instead of a FreqDist, and override __contains__() so we can call encode_key() from the rediscollections module before checking if the RedisHashFreqDist exists.

from nltk.probability import ConditionalFreqDist
from rediscollections import encode_key
class RedisConditionalHashFreqDist(ConditionalFreqDist):
def __init__(self, r, name, cond_samples=None):
self._r = r
self._name = name
ConditionalFreqDist.__init__(self, cond_samples)
# initialize self._fdists for all matching keys
for key in self._r.keys(encode_key('%s:*' % name)):
condition = key.split(':')[1]
self[condition] # calls self.__getitem__(condition)
def __contains__(self, condition):
return encode_key(condition) in self._fdists
def __getitem__(self, condition):
if condition not in self._fdists:
key = '%s:%s' % (self._name, condition)
self._fdists[condition] = RedisHashFreqDist(self._r, key)
return self._fdists[condition]
def clear(self):
for fdist in self._fdists.values():
fdist.clear()

An instance of this class can be created by passing in a Redis connection and a base name. After that, it works just like a ConditionalFreqDist.

>>> from redis import Redis
>>> from redisprob import RedisConditionalHashFreqDist
>>> r = Redis()
>>> rchfd = RedisConditionalHashFreqDist(r, 'condhash')
>>> rchfd.N()
0
>>> rchfd.conditions()
[]
>>> rchfd['cond1'].inc('foo')
>>> rchfd.N()
1
>>> rchfd['cond1']['foo']
1
>>> rchfd.conditions()
['cond1']
>>> rchfd.clear()

How it works...

The RedisConditionalHashFreqDist uses name prefixes to reference RedisHashFreqDist instances. The name passed in to the RedisConditionalHashFreqDist is a base name that is combined with each condition to create a unique name for each RedisHashFreqDist. For example, if the base name of the RedisConditionalHashFreqDist is 'condhash', and the condition is 'cond1', then the final name for the RedisHashFreqDist is 'condhash:cond1'. This naming pattern is used at initialization to find all the existing hash maps using the keys command. By searching for all keys matching 'condhash:*', we can identify all the existing conditions and create an instance of RedisHashFreqDist for each.

Combining strings with colons is a common naming convention for Redis keys as a way to define namespaces. In our case, each RedisConditionalHashFreqDist instance defines a single namespace of hash maps.

The ConditionalFreqDist class stores an internal dictionary at self._fdists that is a mapping of condition to FreqDist. The RedisConditionalHashFreqDist class still uses self._fdists, but the values are instances of RedisHashFreqDist instead of FreqDist. self._fdists is created when we call ConditionalFreqDist.__init__(), and values are initialized as necessary in the __getitem__() method.

There's more...

RedisConditionalHashFreqDist also defines a clear() method. This is a helper method that calls clear() on all the internal RedisHashFreqDist instances. The clear() method is not defined in ConditionalFreqDist.

See also

The previous recipe covers the RedisHashFreqDist in detail.

Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.
Published: November 2010
eBook Price: €18.99
Book Price: €30.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on Python, see here.)

Storing an ordered dictionary in Redis

An ordered dictionary is like a normal dict, but the keys are ordered by an ordering function. In the case of Redis, it supports ordered dictionaries whose keys are strings and whose values are floating point scores. This structure can come in handy for cases such as calculating information gain when you want to store all the words and scores for later use.

Getting ready

Again, you'll need Redis and redis-py installed, with an instance of redis-server running.

How to do it...

The RedisOrderedDict class in rediscollections.py extends collections. MutableMapping to get a number of dict compatible methods for free. Then it implements all the key methods that require Redis ordered set (also known as Zset) commands.

class RedisOrderedDict(collections.MutableMapping):
def __init__(self, r, name):
self._r = r
self._name = encode_key(name)
def __iter__(self):
return iter(self.items())
def __len__(self):
return self._r.zcard(self._name)
def __getitem__(self, key):
val = self._r.zscore(self._name, encode_key(key))
if val is None:
raise KeyError
else:
return val
def __setitem__(self, key, score):
self._r.zadd(self._name, encode_key(key), score)
def __delitem__(self, key):by brain feels dead
self._r.zrem(self._name, encode_key(key))
def keys(self, start=0, end=-1):
# we use zrevrange to get keys sorted by high value instead of by
lowest
return self._r.zrevrange(self._name, start, end)
def values(self, start=0, end=-1):
return [v for (k, v) in self.items(start=start, end=end)]
def items(self, start=0, end=-1):
return self._r.zrevrange(self._name, start, end, withscores=True)
def get(self, key, default=0):
return self[key] or default
def iteritems(self):
return iter(self)
def clear(self):
self._r.delete(self._name)

You can create an instance of RedisOrderedDict by passing in a Redis connection and a unique name.

>>> from redis import Redis
>>> from rediscollections import RedisOrderedDict
>>> r = Redis()
>>> rod = RedisOrderedDict(r, 'test.txt')
>>> rod.get('bar')
>>> len(rod)
0
>>> rod['bar'] = 5.2
>>> rod['bar']
5.2000000000000002
>>> len(rod)
1
>>> rod.items()
[('bar', 5.2000000000000002)]
>>> rod.clear()

How it works...

Much of the code may look similar to the RedisHashMap, which is to be expected since they both extend collections.MutableMapping. The main difference here is that RedisOrderedSet orders keys by floating point values, and so is not suited for arbitrary key-value storage like the RedisHashMap. Here's an outline explaining each key method and how it works with Redis:

  • __len__(): Uses the zcard command to get the number of elements in the ordered set.
  • __getitem__(): Uses the zscore command to get the score of a key, and returns 0 if the key does not exist.
  • __setitem__(): Uses the zadd command to add a key to the ordered set with the given score, or updates the score if the key already exists.
  • __delitem__(): Uses the zrem command to remove a key from the ordered set.
  • keys(): Uses the zrevrange command to get all the keys in the ordered set, sorted by highest score. It takes two optional keyword arguments start and end to more efficiently get a slice of the ordered keys.
  • values(): Extracts all the scores from the items() method.
  • items(): Uses the zrevrange command to get the scores of each key in order to return a list of 2-tuples ordered by highest score. Like keys(), it takes start and end keyword arguments to efficiently get a slice.
  • clear(): Uses the delete command to remove the entire ordered set from Redis.

The default ordering of items in a Redis ordered set is low-to-high, so that the key with the lowest score comes first. This is the same as Python's default list ordering when you call sort() or sorted(), but it's not what we want when it comes to scoring. For storing scores, we expect items to be sorted from high-to-low, which is why keys() and items() use zrevrange instead of zrange.

There's more...

As mentioned previously, the keys() and items() methods take optional start and end keyword arguments to get a slice of the results. This makes the RedisOrderedDict optimal for storing scores, then getting the top N keys. Here's a simple example where we assign three word scores, then get the top two:

>>> from redis import Redis
>>> from rediscollections import RedisOrderedDict
>>> r = Redis()
>>> rod = RedisOrderedDict(r, 'scores')
>>> rod['best'] = 10
>>> rod['worst'] = 0.1
>>> rod['middle'] = 5
>>> rod.keys()
['best', 'middle', 'worst']
>>> rod.keys(start=0, end=1)
['best', 'middle']
>>> rod.clear()

See also

The Storing a frequency distribution in Redis recipe introduces Redis and the RedisHashMap.

Distributed word scoring with Redis and execnet

We can use Redis and execnet together to do distributed word scoring. Now that we have Redis, we will calculate the information gain of each word in the movie_reviews corpus using a RedisHashFreqDist and a RedisConditionalHashFreqDist, then store the scores in a RedisOrderedDict. We can use execnet to distribute the counting in order to get better performance out of Redis.

Getting ready

Redis, redis-py, and execnet must be installed, and an instance of redis-server must be running on localhost.

How to do it...

We start by getting a list of (label, words) tuples for each label in the movie_ reviews corpus (which only has pos and neg labels). Then we get the word_scores using score_words() from the dist_featx module. word_scores is an instance of RedisOrderedDict, and we can see that the total number of words is 39,764. Using the keys() method, we can then get the top 1000 words, and inspect the top five just to see what they are. Once we have all we want from word_scores, we can delete the keys in Redis as we no longer need the data.

>>> from dist_featx import score_words
>>> from nltk.corpus import movie_reviews
>>> labels = movie_reviews.categories()
>>> labelled_words = [(l, movie_reviews.words(categories=[l])) for l
in labels]
>>> word_scores = score_words(labelled_words)
>>> len(word_scores)
39764
>>> topn_words = word_scores.keys(end=1000)
>>> topn_words[0:5]
['_', 'bad', '?', 'movie', 't']
>>> from redis import Redis
>>> r = Redis()
>>> [r.delete(key) for key in ['word_fd', 'label_word_fd:neg', 'label_
word_fd:pos', 'word_scores']]
[True, True, True, True]

The score_words() function in dist_featx can take a while to complete, so expect to wait a couple of minutes. The overhead of using execnet and Redis means it will take significantly longer than a non-distributed in-memory version of the function.

How it works...

The dist_featx.py module contains the score_words() function, which does the following:

  1. Opens gateways and channels, sending initialization data to each.
  2. Sends each (label, words) tuple over a channel for counting.
  3. Sends a done message to each channel, waits for a done reply back, then closes the channels and gateways.
  4. Calculates the score of each word based on the counts and stores in a RedisOrderedDict.

In our case of counting words in the movie_reviews corpus, calling score_words() opens two gateways and channels, one for counting the pos words, and the other for counting the neg words. The communication is as follows:

Python Text Processing with NLTK: Storing Frequency Distributions in Redis

Once the counting is finished, we can score all the words and store the results. The code itself is as follows:

import itertools, execnet, remote_word_count
from nltk.metrics import BigramAssocMeasures
from redis import Redis
from redisprob import RedisHashFreqDist, RedisConditionalHashFreqDist
from rediscollections import RedisOrderedDict
def score_words(labelled_words, score_fn=BigramAssocMeasures.chi_sq,
host='localhost', specs=[('popen', 2)]):
gateways = []
channels = []
for spec, count in specs:
for i in range(count):
gw = execnet.makegateway(spec)
gateways.append(gw)
channel = gw.remote_exec(remote_word_count)
channel.send((host, 'word_fd', 'label_word_fd'))
channels.append(channel)
cyc = itertools.cycle(channels)
for label, words in labelled_words:
channel = cyc.next()
channel.send((label, list(words)))
for channel in channels:
channel.send('done')
assert 'done' == channel.receive()
channel.waitclose(5)
for gateway in gateways:
gateway.exit()
r = Redis(host)
fd = RedisHashFreqDist(r, 'word_fd')
cfd = RedisConditionalHashFreqDist(r, 'label_word_fd')
word_scores = RedisOrderedDict(r, 'word_scores')
n_xx = cfd.N()
for label in cfd.conditions():
n_xi = cfd[label].N()
for word, n_ii in cfd[label].iteritems():
n_ix = fd[word]
if n_ii and n_ix and n_xi and n_xx:
score = score_fn(n_ii, (n_ix, n_xi), n_xx)
word_scores[word] = score
return word_scores

Note that this scoring method will only be accurate when there are two labels. If there are more than two labels, then word scores for each label should be stored in separate RedisOrderedDict instances, one per label.

The remote_word_count.py module looks as follows:

from redis import Redis
from redisprob import RedisHashFreqDist, RedisConditionalHashFreqDist
if __name__ == '__channelexec__':
host, fd_name, cfd_name = channel.receive()
r = Redis(host)
fd = RedisHashFreqDist(r, fd_name)
cfd = RedisConditionalHashFreqDist(r, cfd_name)
for data in channel:
if data == 'done':
channel.send('done')
break
label, words = data
for word in words:
fd.inc(word)
cfd[label].inc(word)

You'll notice this is not a pure module as it requires being able to import both redis and redisprob. The reason is that instances of RedisHashFreqDist and RedisConditionalHashFreqDist cannot be pickled and sent over the channel. Instead, we send the host name and key names over the channel so we can create the instances in the remote module. Once we have the instances, there are two kinds of data we can receive over the channel:

  1. A done message, which signals that there is no more data coming in over the channel. We reply back with another done message, then exit the loop to close the channel.
  2. A 2-tuple of (label, words), which we then iterate over to increment counts in both the RedisHashFreqDist and RedisConditionalHashFreqDist.

There's more...

In this particular case, it would be faster to compute the scores without using Redis or execnet. However, by using Redis, we can store the scores persistently for later examination and usage. Being able to inspect all the word counts and scores manually is a great way to learn about your data. We can also tweak feature extraction without having to re-compute the scores. For example, you could use featx.bag_of_words_in_set() with the top N words from the RedisOrderedDict, where N could be 1,000, 2,000, or whatever number you want. If our data size is much greater, the benefits of execnet will be much more apparent. Horizontal scalability using execnet or some other method to distribute computations across many nodes becomes more valuable, as the size of the data you need to process increases.

Summary

In this article, we saw how to use the Redis data structure server/database to store frequency distributions.


Further resources on this subject:


Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities.
Published: November 2010
eBook Price: €18.99
Book Price: €30.99
See more
Select your format and quantity:

About the Author :


Jacob Perkins

Jacob Perkins is the author of Packt’s Python Text Processing with NLTK 2.0 Cookbook, and a contributor to the Bad Data Handbook. He is the CTO and cofounder of Weotta, a natural-language-based search engine for local entertainment. He created http://text-processing.com, which demos NLTK functionality and provides natural language processing APIs. Jacob also writes about natural language processing and Python programming at http://streamhacker.com and you can follow him on Twitter - twitter.com/japerk

Books From Packt


Python Testing: Beginner's Guide
Python Testing: Beginner's Guide

Python 3 Object Oriented Programming
Python 3 Object Oriented Programming

Spring Python 1.1
Spring Python 1.1

Python Multimedia
Python Multimedia

Plone 3 Intranets
Plone 3 Intranets

wxPython 2.8 Application Development   Cookbook: RAW
wxPython 2.8 Application Development Cookbook: RAW

MySQL for Python
MySQL for Python

Python Geo-Spatial Developmentt
Python Geo-Spatial Development


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software