Reader small image

You're reading from  Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785889950
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Megan Squire
Megan Squire
author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Right arrow

Chapter 8. Topic Modeling in Text

Topic modeling in text is loosely related to the summarization techniques we explored in Chapter 7, Automatic Text Summarization. However, topic modeling involves a more complex mathematical foundation and it produces a different type of result. The goal of text summarization is to produce a version of a text that is reduced but still expresses common themes or concepts in a text, whereas the goal of topic modeling is to expose the underlying concepts themselves.

To extend our Chapter 7, Automatic Text Summarization metaphor, in which text summarization was compared to building a scale model of a house, topic modeling is like trying to describe the purpose of a set of houses based on multiple sample dwellings. For example, the topic model of one neighborhood of houses might be busy family, storage space, and low maintenance and another neighborhood could have houses described with the words social, entertaining, luxury, and showplace. These two models clearly...

What is topic modeling?


Just like with the keyword-based text summarization techniques we looked at in Chapter 7, Automatic Text Summarization, topic modeling also takes into account what words are used in a text. However, the focus of topic modeling is more about themes and concepts, and not solely about summarizing text. Topic models can be used for summarization, but they can also be used for many other goals:

  • Topic models can assist with organization of documents, for example, to group news articles together into a cohesive section

  • Topic models can help us make recommendations about what to read next by finding materials that have a topic list in common

  • Topic models can improve search results by revealing documents that may use a mix of different keywords but are about the same idea

One critical component of the type topic modeling we will investigate in this chapter is that the analyst does not need to know what the topics or keywords are in advance. Instead, the model is created in an...

Latent Dirichlet Allocation


The most common technique currently in use for topic modeling of text, and the one that the Facebook researchers used in their 2013 paper, is called Latent Dirichlet Allocation (LDA).

Tip

Many people wonder how to pronounce Dirichlet in English. The most common pronunciation I have heard is DEER-uh-shlay, and I have also heard DEER-uh-klay a few times.

LDA was first proposed for text topic extraction by David Blei, Andrew Ng, and Michael Jordan in a 2003 paper entitled simply Latent Direchlet Allocation, available from the Journal of Machine Learning Research at http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf. Blei also wrote a good follow-up article in 2012 for the Communications of the ACM about LDA and some new variants and improvements for it. This later article is written in very accessible language and is available for download at https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf.

The first thing we should know about LDA is that it is a probabilistic...

Gensim for topic modeling


We used the Gensim library already in Chapter 7, Automatic Text Summarization for extracting keywords and summaries of text. Here we will use it for building a topic model of a collection of texts. Just as we did in earlier chapters, we will practice with a few different types of document collections and see how the results vary.

First, we will build a small test program to make sure that Gensim and LDA are installed correctly and able to generate a topic model from a collection of documents. If Gensim is not loaded into your version of Anaconda, simply run conda install gensim in your terminal.

We begin with importing the Gensim libraries and a PrettyPrinter for formatting:

from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
import pprint

We will need some variables to serve as ways of adjusting the model. As we learn how topic modeling works, we will tweak these values to see how the results change...

Gensim LDA for a larger project


Let's learn how the LDA topic modeling process changes when we have a larger set of documents and words to work with. Suppose we extend the LKML data set to include not just the 78 e-mails from January 2016, but instead, what if we use all the e-mails Linus Torvalds has ever sent to the LKML? After cleaning the data to remove missing messages, source code, attachments, Linus' own name used as a signature, and end-of-line characters, we have a single text file containing 22,546 e-mails. This e-mail text file, called lkmlLinusAll.txt, is provided on the GitHub site for this chapter at https://github.com/megansquire/masteringDM/tree/master/ch8.

After reading these into a dictionary, our program reports that there are 26,709 unique tokens. Asking for the same four topics, five words, but asking for only one pass over this large data set yields the following topic list:

[   
(0,'0.014*people + 0.013*think + 0.011*merge + 0.010*actually + 0.010*like'),
(1,'0.011*fix...

Summary


We now have a basic understanding of how probabilistic topic modeling works and we have worked to implement one of the most popular tools for performing this analysis on text: the Gensim implementation of Latent Dirichlet Allocation, or LDA. We learned how to write a simple program to implement LDA modeling on a variety of text samples, some with greater success than others. We learned about how the model can be manipulated by changing the input variables, such as the number of topics and the number of passes over the data. We also discovered that topic lists can change over time, and while more data tends to produce a stronger model, it also tends to obscure niche topics that might have been very important for only a moment in time.

In this topic modeling chapter – perhaps even more than in some of the other chapters – our unsupervised learning approach meant that we experienced how our results are truly dependent on the volume, quality, and uniformity of the data we started with...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Data Mining with Python - Find patterns hidden in your data
Published in: Aug 2016Publisher: ISBN-13: 9781785889950
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire