Reader small image

You're reading from  Natural Language Processing and Computational Linguistics

Product typeBook
Published inJun 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781788838535
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Bhargav Srinivasa-Desikan
Bhargav Srinivasa-Desikan
author image
Bhargav Srinivasa-Desikan

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is a part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. He is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python.
Read more about Bhargav Srinivasa-Desikan

Right arrow

Chapter 9. Advanced Topic Modeling

We saw in the previous chapter the power of topic modeling, and how intuitive a way it can be to understand our data, as well as explore it. In this chapter, we will further explore the utility of these topic models, and also on how to create more useful topic models which better encapsulates the topics which may be present in a corpus. Since topic modeling is a way to understand the documents of a corpus, it also means we can analyze documents in ways we have not done before.

In this chapter, we will cover the following topics:

  • Advanced training tips
  • Exploring documents
  • Topic coherence and evaluating topic models
  • Visualizing topic models

Advanced training tips

In Chapter 8, Topic Models, we explored what topic models are, and how to set them up with both Gensim and scikit-learn. But just setting up a topic model isn't sufficient - a poorly trained topic model would not offer us any useful information.

We've already talked about the most important pretraining tip - preprocessing. It would be quite clear now that garbage in is garbage out, but sometimes even after ensuring it isn't garbage you're putting in, we still get nonsense outputs. In this section, we will briefly discuss what else it is you can do to polish your results.

It would be wise to re-look at Chapter 3, spaCy's Language Model, and Chapter 4, Gensim - Vectorizing Text and Transformations and n-grams, now - they introduce the methods used in preprocessing, which is usually the first advanced training tip given. It is worth...

Exploring documents

Once we have our topic model of choice set up, we can use it to analyze our corpus, and also get some more insight into the nature of our topic models. While it is certainly useful to know what kind of topics are present in our dataset, to go one step further we should be able to, for example, cluster or classify our documents based on what topics they are made out of.

In our Jupyter notebook example from Chapter 8, Topic Models, let's start looking at document-topic proportions. What exactly are these? When we were looking at topics in the previous chapter, we were observing topic-word proportions - what are the odds of certain words appearing in certain topics. We previously mentioned that we assumed that documents are generated from topics - by identifying document-topic proportions, we can see exactly how the topics generated the documents.

So, do...

Topic coherence and evaluating topic models

In the previous sections, we spoke extensively about how topic models, in general, are rather qualitative in nature - it's difficult to put a number on how useful a topic model is. Despite this, there is a need to evaluate topic models, and the most popular method out there is topic coherence - and lucky for us, Gensim has quite an extensive suite of topic coherence methods for us to try out.

What exactly is topic coherence? Briefly put, it is a measure of how interpretable topics are for human beings. There are multiple coherence measures in topic modeling literature, and we won't be going through the theory for these, but the following links should walk you through the theory and intuition, if interested:

  1. What is topic coherence? [9]
  2. Exploring the Space of Topic Coherence Measures [10]

The first link is a Gensim blog post...

Visualizing topic models

Like we have said before, the purpose of topic models is to better understand our textual data - and visualizations are one of the best ways to understand and look at our data. There are multiple ways and techniques to visualize topic models - we will be focusing on the methods implemented and compatible with Gensim, but like we have done throughout the book, we will be providing links and documentation to the other popular topic modeling visualization tools.

One of the most popular topic modeling visualization libraries is LDAvis - an R library build largely on D3, it has been ported to Python as pyLDAvis and is just as nifty in Python and is very well integrated with Gensim as well. It is based on the original paper (LDAvis: A method for visualizing and interpreting topics [19]) by Carson Sievert and Kenneth E. Shirley.

The pyLDAvis library is agnostic...

Summary

With Chapter 8, Topic Models and Chapter 9, Advanced Topic Modelling, we are now equipped with the tools and knowledge of applying topic models to our textual data. Topic modelling is a largely data exploratory tool, but we can also carry out some more targeted analysis, like seeing the topics which make up a document, or which words in a document belong to which topic. Gensim gives us the functionality to carry out these tasks quite easily, with its API constructed so that we can access the mathematical information behind topic models without a hassle.

In the next chapter, we will carry our more targeted text analysis tasks, such as clustering or classification. Clustering and classification algorithms are largely used in text analysis to group similar documents together and are machine learning algorithms. We will explain the intuition behind these methods as well as...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Natural Language Processing and Computational Linguistics
Published in: Jun 2018Publisher: PacktISBN-13: 9781788838535
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Bhargav Srinivasa-Desikan

Bhargav Srinivasa-Desikan is a research engineer working for INRIA in Lille, France. He is a part of the MODAL (Models of Data Analysis and Learning) team, and he works on metric learning, predictor aggregation, and data visualization. He is a regular contributor to the Python open source community, and completed Google Summer of Code in 2016 with Gensim where he implemented Dynamic Topic Models. He is a regular speaker at PyCons and PyDatas across Europe and Asia, and conducts tutorials on text analysis using Python.
Read more about Bhargav Srinivasa-Desikan