Packt+ | Advance your knowledge in tech

You're reading from Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook

Published inAug 2016

Reading LevelIntermediate

Publisher

ISBN-139781785889950

Edition1st Edition

Languages

Python

Tools

NLTK Scikit-learn

Concepts

Data Mining

Author (1)

Megan Squire

Chapter 8. Topic Modeling in Text

Topic modeling in text is loosely related to the summarization techniques we explored in Chapter 7, Automatic Text Summarization. However, topic modeling involves a more complex mathematical foundation and it produces a different type of result. The goal of text summarization is to produce a version of a text that is reduced but still expresses common themes or concepts in a text, whereas the goal of topic modeling is to expose the underlying concepts themselves.

To extend our Chapter 7, Automatic Text Summarization metaphor, in which text summarization was compared to building a scale model of a house, topic modeling is like trying to describe the purpose of a set of houses based on multiple sample dwellings. For example, the topic model of one neighborhood of houses might be busy family, storage space, and low maintenance and another neighborhood could have houses described with the words social, entertaining, luxury, and showplace. These two models clearly...

What is topic modeling?

Just like with the keyword-based text summarization techniques we looked at in Chapter 7, Automatic Text Summarization, topic modeling also takes into account what words are used in a text. However, the focus of topic modeling is more about themes and concepts, and not solely about summarizing text. Topic models can be used for summarization, but they can also be used for many other goals:

Topic models can assist with organization of documents, for example, to group news articles together into a cohesive section
Topic models can help us make recommendations about what to read next by finding materials that have a topic list in common
Topic models can improve search results by revealing documents that may use a mix of different keywords but are about the same idea

One critical component of the type topic modeling we will investigate in this chapter is that the analyst does not need to know what the topics or keywords are in advance. Instead, the model is created in an...

Latent Dirichlet Allocation

The most common technique currently in use for topic modeling of text, and the one that the Facebook researchers used in their 2013 paper, is called Latent Dirichlet Allocation (LDA).

Tip

Many people wonder how to pronounce Dirichlet in English. The most common pronunciation I have heard is DEER-uh-shlay, and I have also heard DEER-uh-klay a few times.

LDA was first proposed for text topic extraction by David Blei, Andrew Ng, and Michael Jordan in a 2003 paper entitled simply Latent Direchlet Allocation, available from the Journal of Machine Learning Research at http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf. Blei also wrote a good follow-up article in 2012 for the Communications of the ACM about LDA and some new variants and improvements for it. This later article is written in very accessible language and is available for download at https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf.

The first thing we should know about LDA is that it is a probabilistic...

Gensim for topic modeling

We used the Gensim library already in Chapter 7, Automatic Text Summarization for extracting keywords and summaries of text. Here we will use it for building a topic model of a collection of texts. Just as we did in earlier chapters, we will practice with a few different types of document collections and see how the results vary.

First, we will build a small test program to make sure that Gensim and LDA are installed correctly and able to generate a topic model from a collection of documents. If Gensim is not loaded into your version of Anaconda, simply run conda install gensim in your terminal.

We begin with importing the Gensim libraries and a PrettyPrinter for formatting:

from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
import pprint

We will need some variables to serve as ways of adjusting the model. As we learn how topic modeling works, we will tweak these values to see how the results change...

Gensim LDA for a larger project

Let's learn how the LDA topic modeling process changes when we have a larger set of documents and words to work with. Suppose we extend the LKML data set to include not just the 78 e-mails from January 2016, but instead, what if we use all the e-mails Linus Torvalds has ever sent to the LKML? After cleaning the data to remove missing messages, source code, attachments, Linus' own name used as a signature, and end-of-line characters, we have a single text file containing 22,546 e-mails. This e-mail text file, called lkmlLinusAll.txt, is provided on the GitHub site for this chapter at https://github.com/megansquire/masteringDM/tree/master/ch8.

After reading these into a dictionary, our program reports that there are 26,709 unique tokens. Asking for the same four topics, five words, but asking for only one pass over this large data set yields the following topic list:

[   
(0,'0.014*people + 0.013*think + 0.011*merge + 0.010*actually + 0.010*like'),
(1,'0.011*fix...

Summary

We now have a basic understanding of how probabilistic topic modeling works and we have worked to implement one of the most popular tools for performing this analysis on text: the Gensim implementation of Latent Dirichlet Allocation, or LDA. We learned how to write a simple program to implement LDA modeling on a variety of text samples, some with greater success than others. We learned about how the model can be manipulated by changing the input variables, such as the number of topics and the number of passes over the data. We also discovered that topic lists can change over time, and while more data tends to produce a stronger model, it also tends to obscure niche topics that might have been very important for only a moment in time.

In this topic modeling chapter – perhaps even more than in some of the other chapters – our unsupervised learning approach meant that we experienced how our results are truly dependent on the volume, quality, and uniformity of the data we started with...

The rest of the chapter is locked

You have been reading a chapter from

Mastering Data Mining with Python - Find patterns hidden in your data

Published in: Aug 2016Publisher: ISBN-13: 9781785889950

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages