Topic modeling is a relatively recent and exciting area that originated from the fields of natural language processing and information retrieval but has seen applications in a number of other domains as well. Many problems in classification, such as sentiment analysis, involve assigning a single class to a particular observation. In topic modeling, the key idea is that we can assign a mixture of different classes to an observation. As the field is inspired from information retrieval, we often think of our observations as documents and our output classes as topics. In many applications, this is actually the case and so we will focus on the domain of text documents and their topics, this being a very natural way to learn about this important model. In particular, we'll focus on a technique known as Latent Dirichlet Allocation (LDA), which is the most prominently used method for topic modeling.
You're reading from Mastering Predictive Analytics with R
In Chapter 8, Probabilistic Graphical Models, we saw how we can use a bag of words as the features of a Naïve Bayes model in order to perform sentiment analysis. There, the specific predictive task involved determining whether a particular movie review was expressing a positive sentiment or a negative sentiment. We explicitly assumed that the movie review was exclusively expressing only one possible sentiment. Each of the words used as features (such as bad, good, fun, and so on) had a different likelihood of appearing in a review under each sentiment.
To compute the model's decision, we basically compute the likelihood of all the words in a particular review under one class, and compare this to the likelihood of all the words having been generated by the other class. We adjusted these likelihoods using the prior probability of each class so that when we know that one class is more popular in the training data, we expect to find it more frequently represented...
Latent Dirichlet Allocation (LDA) is the prototypical method to perform topic modeling. Rather unfortunately, the acronym LDA is also used for another method in machine learning, Linear Discriminant Analysis. This latter method is completely different to Latent Dirichlet Allocation and is commonly used as a way to perform dimensionality reduction and classification. Needless to say, we will use LDA to refer to Latent Dirichlet Allocation throughout this book.
Although LDA involves a substantial amount of mathematics, it is worth exploring some of its technical details in order to understand how the model works and the assumptions that it uses. First and foremost, we should learn about the Dirichlet distribution, which lends its name to LDA.
To see how topic models perform on real data, we will look at two data sets containing articles originating from BBC News during the period of 2004-2005. The first data set, which we will refer to as the BBC data set, contains 2,225 articles that have been grouped into five topics. These are business, entertainment, politics, sports, and technology.
The second data set, which we will call the BBCSports data set, contains 737 articles only on sports. These are also grouped into five categories according to the type of sport being described. The five sports in question are athletics, cricket, football, rugby, and tennis. Our objective will be to see if we can build topic models for each of these two data sets that will group together articles from the same major topic.
This chapter was devoted to learning about topic models and after sentiment analysis on movie reviews, this was our second foray into working with real-life text data. This time, our predictive task was classifying the topics of news articles on the Web. The primary technique for topic modeling on which we focused was Latent Dirichlet Allocation (LDA). This derives its name from the fact that it assumes that the topic and word distributions that can be found inside a document arise from hidden multinomial distributions that are sampled from Dirichlet priors. We saw that the generative process of sampling words and topics from these multinomial distributions mirrors many of the natural intuitions that we have about this domain; however, it notably fails to account for correlations between the various topics that can co-occur inside a document.
In our experiments with LDA, we saw that there is more than one way to fit an LDA model, and in particular we saw that a method known as Gibbs...