You're reading from The Natural Language Processing Workshop

Product type Book

Published in Aug 2020

Publisher Packt

ISBN-13 9781800208421

Pages 452 pages

Edition 1st Edition

Languages

Python

Concepts

Mobile Application Development

Authors (6):

Rohan Chopra

Aniruddha M. Godbole

Nipun Sadvilkar

Muzaffar Bashir Shah

Sohom Ghosh

Dwight Gunning

View More author details

Table of Contents (10) Chapters

Preface

1. Introduction to Natural Language Processing

2. Feature Extraction Methods

3. Developing a Text Classifier

4. Collecting Text Data with Web Scraping and APIs

5. Topic Modeling

6. Vector Representation

7. Text Generation and Summarization

8. Sentiment Analysis

Appendix

5. Topic Modeling

Overview

This chapter introduces topic modeling, which means using unsupervised machine learning to find "topics" within a given set of documents. You will explore the most common approaches to topic modeling, which are Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and the Hierachical Dirichlet Process (HDP), and learn the differences between them. You will then practice implementing these approaches in Python and review the common practical challenges in topic modeling. By the end of this chapter, you will be able to create topic models from any given dataset.

Introduction

In the previous chapter, we learned about different ways to collect data from local files and online resources. In this chapter, we will focus on topic modeling, which is an important area within natural language processing. Topic modeling is a simple way to capture the sense of what a document or a collection of documents is about. Note that in this case, documents are any coherent collection of words, which could be as short as a tweet or as long as an encyclopedia.

Topic modeling may be thought of as a way to automate the manual task of reading given document(s) to write an abstract, which you will then use to map the document(s) to a set of topics. Topic modeling is mostly done using unsupervised learning algorithms that detect topics on their own. Topic-modeling algorithms operate by performing statistical analysis on words or tokens in documents and using those statistics to automatically assign each document to multiple topics. A topic is represented by an arbitrary...

Topic Discovery

The main goal of topic modeling is to find a set of topics that can be used to classify a set of documents. These topics are implicit because we do not know what they are beforehand, and they are unnamed.

The number of topics could vary from around 3 to, say, 400 (or even more) topics. Since it is the algorithm that discovers the topics, the number is generally fixed as an input to the algorithm, except in the case of non-parametric models in which the number of topics is inferred from the text. These topics may not always directly correspond to topics that a human would find meaningful. In practice, the number of topics should be much smaller than the number of documents. In general, the number of topics specified in a parametric model ought to be greater than or equal to the expected number of topics in the text. In other words, one should err on the side of a greater number of topics rather than fewer topics. This is because fewer topics can cause a problem for...

Topic-Modeling Algorithms

Topic-modeling algorithms operate on the following assumptions:

Topics contain a set of words.
Documents are made up of a set of topics.

Topics can be considered to be a weighted collection of words. After these common assumptions, different algorithms diverge in how they go about discovering topics. In the upcoming sections, we will cover in detail three topic-modeling algorithms—namely LSA, LDA, and HDP. Here, the term latent (the L in these acronyms) refers to the fact that the probability distribution of the topics is not directly observable. We can observe the documents and the words but not the topics.

Note

The LDA algorithm builds on the LSA algorithm. In this case, similar acronyms are indicative of this association.

Latent Semantic Analysis (LSA)

We will start by looking at LSA. LSA actually predates the World Wide Web. It was first described in 1988. LSA is also known by an alternative name, Latent Semantic Indexing...

Key Input Parameters for LSA Topic Modeling

We will be using the gensim library to perform LSA topic modeling. The key input parameters for gensim are corpus, the number of topics, and id2word. Here, the corpus is specified in the form of a list of documents in which each document is a list of tokens. The id2word parameter refers to a dictionary that is used to convert the corpus from a textual representation to a numeric representation such that each word corresponds to a unique number. Let's do an exercise to understand this concept better.

spaCy is a popular natural language processing Library for Python. In our exercises, we will be using spaCy to tokenize the text, lemmatize the tokens, and check which part-of-speech that token is. We will be using spaCy v2.1.3. After installing spaCy v2.1.3 we will need to download the English language model using the following code, so that we can load this model (since there are models for many different languages).

python -m spacy...

Hierarchical Dirichlet Process (HDP)

HDP is a non-parametric variant of LDA. It is called "non-parametric" since the number of topics is inferred from the data, and this parameter isn't provided by us. This means that this parameter is learned and can increase (that is, it is theoretically unbounded).

The tomotopy HDP implementation can infer between 1 and 32,767 topics. gensim's HDP implementation seems to fix the number of topics at 150 topics. For our purposes, we will be using the tomotopy HDP implementation.

The gensim and the scikit-learn libraries use variational inference, while the tomotopy library uses collapsed Gibbs sampling. When the time required by collapsed Gibbs sampling is not an issue, then it is preferable to use collapsed Gibbs sampling over variational inference. In other cases, we may prefer to use variational inference. For the tomotopy library, the following parameters are used:

iter: This refers to the number of iterations that...

Summary

In this chapter, we discussed topic modeling in detail. Without delving into advanced statistics, we reviewed various topic-modeling algorithms (such as LSA, LDA, and HDP) and how they can be used for topic modeling on a given dataset. We explored the challenges involved in topic modeling, how experimentation can help address those challenges, and, finally, broadly discussed the current state-of-the-art approaches to topic modeling.

In the next chapter, we will learn about vector representation of text, which helps us convert text into a numerical format to make it more easily understandable by machines.