Reader small image

You're reading from  The Natural Language Processing Workshop

Product typeBook
Published inAug 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800208421
Edition1st Edition
Languages
Tools
Right arrow
Authors (6):
Rohan Chopra
Rohan Chopra
author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

Aniruddha M. Godbole
Aniruddha M. Godbole
author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

Nipun Sadvilkar
Nipun Sadvilkar
author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

Muzaffar Bashir Shah
Muzaffar Bashir Shah
author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

Sohom Ghosh
Sohom Ghosh
author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

Dwight Gunning
Dwight Gunning
author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning

View More author details
Right arrow

5. Topic Modeling

Overview

This chapter introduces topic modeling, which means using unsupervised machine learning to find "topics" within a given set of documents. You will explore the most common approaches to topic modeling, which are Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and the Hierachical Dirichlet Process (HDP), and learn the differences between them. You will then practice implementing these approaches in Python and review the common practical challenges in topic modeling. By the end of this chapter, you will be able to create topic models from any given dataset.

Introduction

In the previous chapter, we learned about different ways to collect data from local files and online resources. In this chapter, we will focus on topic modeling, which is an important area within natural language processing. Topic modeling is a simple way to capture the sense of what a document or a collection of documents is about. Note that in this case, documents are any coherent collection of words, which could be as short as a tweet or as long as an encyclopedia.

Topic modeling may be thought of as a way to automate the manual task of reading given document(s) to write an abstract, which you will then use to map the document(s) to a set of topics. Topic modeling is mostly done using unsupervised learning algorithms that detect topics on their own. Topic-modeling algorithms operate by performing statistical analysis on words or tokens in documents and using those statistics to automatically assign each document to multiple topics. A topic is represented by an arbitrary...

Topic Discovery

The main goal of topic modeling is to find a set of topics that can be used to classify a set of documents. These topics are implicit because we do not know what they are beforehand, and they are unnamed.

The number of topics could vary from around 3 to, say, 400 (or even more) topics. Since it is the algorithm that discovers the topics, the number is generally fixed as an input to the algorithm, except in the case of non-parametric models in which the number of topics is inferred from the text. These topics may not always directly correspond to topics that a human would find meaningful. In practice, the number of topics should be much smaller than the number of documents. In general, the number of topics specified in a parametric model ought to be greater than or equal to the expected number of topics in the text. In other words, one should err on the side of a greater number of topics rather than fewer topics. This is because fewer topics can cause a problem for...

Topic-Modeling Algorithms

Topic-modeling algorithms operate on the following assumptions:

  • Topics contain a set of words.
  • Documents are made up of a set of topics.

Topics can be considered to be a weighted collection of words. After these common assumptions, different algorithms diverge in how they go about discovering topics. In the upcoming sections, we will cover in detail three topic-modeling algorithms—namely LSA, LDA, and HDP. Here, the term latent (the L in these acronyms) refers to the fact that the probability distribution of the topics is not directly observable. We can observe the documents and the words but not the topics.

Note

The LDA algorithm builds on the LSA algorithm. In this case, similar acronyms are indicative of this association.

Latent Semantic Analysis (LSA)

We will start by looking at LSA. LSA actually predates the World Wide Web. It was first described in 1988. LSA is also known by an alternative name, Latent Semantic Indexing...

Key Input Parameters for LSA Topic Modeling

We will be using the gensim library to perform LSA topic modeling. The key input parameters for gensim are corpus, the number of topics, and id2word. Here, the corpus is specified in the form of a list of documents in which each document is a list of tokens. The id2word parameter refers to a dictionary that is used to convert the corpus from a textual representation to a numeric representation such that each word corresponds to a unique number. Let's do an exercise to understand this concept better.

spaCy is a popular natural language processing Library for Python. In our exercises, we will be using spaCy to tokenize the text, lemmatize the tokens, and check which part-of-speech that token is. We will be using spaCy v2.1.3. After installing spaCy v2.1.3 we will need to download the English language model using the following code, so that we can load this model (since there are models for many different languages).

python -m spacy...

Hierarchical Dirichlet Process (HDP)

HDP is a non-parametric variant of LDA. It is called "non-parametric" since the number of topics is inferred from the data, and this parameter isn't provided by us. This means that this parameter is learned and can increase (that is, it is theoretically unbounded).

The tomotopy HDP implementation can infer between 1 and 32,767 topics. gensim's HDP implementation seems to fix the number of topics at 150 topics. For our purposes, we will be using the tomotopy HDP implementation.

The gensim and the scikit-learn libraries use variational inference, while the tomotopy library uses collapsed Gibbs sampling. When the time required by collapsed Gibbs sampling is not an issue, then it is preferable to use collapsed Gibbs sampling over variational inference. In other cases, we may prefer to use variational inference. For the tomotopy library, the following parameters are used:

iter: This refers to the number of iterations that...

Summary

In this chapter, we discussed topic modeling in detail. Without delving into advanced statistics, we reviewed various topic-modeling algorithms (such as LSA, LDA, and HDP) and how they can be used for topic modeling on a given dataset. We explored the challenges involved in topic modeling, how experimentation can help address those challenges, and, finally, broadly discussed the current state-of-the-art approaches to topic modeling.

In the next chapter, we will learn about vector representation of text, which helps us convert text into a numerical format to make it more easily understandable by machines.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Natural Language Processing Workshop
Published in: Aug 2020Publisher: PacktISBN-13: 9781800208421
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (6)

author image
Rohan Chopra

Rohan Chopra graduated from Vellore Institute of Technology with a bachelors degree in computer science. Rohan has an experience of more than 2 years in designing, implementing, and optimizing end-to-end deep neural network systems. His research is centered around the use of deep learning to solve computer vision-related problems and has hands-on experience working on self-driving cars. He is a data scientist at Absolutdata.
Read more about Rohan Chopra

author image
Aniruddha M. Godbole

Aniruddha M. Godbole is a data science consultant with inter-disciplinary expertise in computer science, applied statistics, and finance. He has a master's degree in data science from Indiana University, USA, and has done MBA in finance from the National Institute of Bank Management, India. He has authored papers in computer science and finance and has been an occasional opinion pages contributor to Mint, which is a leading business newspaper in India. He has fifteen years of experience.
Read more about Aniruddha M. Godbole

author image
Nipun Sadvilkar

Nipun Sadvilkar is a senior data scientist at US healthcare company leading a team of data scientists and subject matter expertise to design and build the clinical NLP engine to revamp medical coding workflows, enhance coder efficiency, and accelerate revenue cycle. He has experience of more than 3 years in building NLP solutions and web-based data science platforms in the area of healthcare, finance, media, and psychology. His interests lie at the intersection of machine learning and software engineering with a fair understanding of the business domain. He is a member of the regional and national python community. He is author of pySBD - an NLP open-source python library for sentence segmentation which is recognized by ExplosionAI (spaCy) and AllenAI (scispaCy) organizations.
Read more about Nipun Sadvilkar

author image
Muzaffar Bashir Shah

Muzaffar Bashir Shah is a software developer with vast experience in machine learning, natural language processing (NLP), text analytics, and data science. He holds a masters degree in computer science from the University of Kashmir and is currently working in a Bangalore based startup named Datoin.
Read more about Muzaffar Bashir Shah

author image
Sohom Ghosh

Sohom Ghosh is a passionate data detective with expertise in natural language processing. He has worked extensively in the data science arena with a specialization in deep learning-based text analytics, NLP, and recommendation systems. He has publications in several international conferences and journals.
Read more about Sohom Ghosh

author image
Dwight Gunning

Dwight Gunning is a data scientist at FINRA, a financial services regulator in the US. He has extensive experience in Python-based machine learning and hands-on experience with the most popular NLP tools such as NLTK, gensim, and spacy.
Read more about Dwight Gunning