Topic models are a powerful method to group documents by their main topics. Topic models allow probabilistic modeling of term frequency occurrence in documents. The fitted model can be used to estimate the similarity between documents, as well as between a set of specified keywords using an additional layer of latent variables, which are referred to as topics (Grun and Hornik, 2011). In essence, a document is assigned to a topic based on the distribution of the words in that document, and the other documents in that topic will have roughly the same frequency of words.
In this tutorial, we will look at a useful framework for text mining, called topic models. We will apply the framework to the State of the Union addresses.
In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no prior reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. There are different methods that come under Topic Modeling. We'll look at LDA with Gibbs sampling. This method is quite complicated mathematically, but my intent is to provide an introduction so that you are at least able to describe how the algorithm learns to assign a document to a topic in layperson terms. If you are interested in mastering the math associated with the method, block out a couple of hours on your calendar and have a go at it. Excellent background material can be found here.
This tutorial is an excerpt taken from the book 'Mastering Machine Learning with R - Third Edition' written by Cory Lesmeister. The book explores expert techniques for solving data analytics and covers machine learning challenges that can help you gain insights from complex projects and power up your applications.
Talking about LDA or Latent Dirichlet Allocation in topic modeling, it is a generative process, and works in the following manner to iterate to a steady state:
The LDA assumes that the order of words and documents does not matter. There has been work done to relax these assumptions in order to build models of language generation and sequence models over time (known as dynamic topic modeling or DTM).
We will leave behind the 19th century and look at these recent times of trial and tribulation (1965 through 2016). On looking at this data, I found something interesting and troubling. Let's take a look at the 1970s:
> sotu_meta[185:191, 1:4]
# A tibble: 7 x 4
president year years_active party
<chr> <int> <chr> <chr>
1 Richard M. Nixon 1970 1969-1973 Republican
2 Richard M. Nixon 1971 1969-1973 Republican
3 Richard M. Nixon 1972 1969-1973 Republican
4 Richard M. Nixon 1972 1969-1973 Republican
5 Richard M. Nixon 1974 1973-1974 Republican
6 Richard M. Nixon 1974 1973-1974 Republican
7 Gerald R. Ford 1975 1974-1977 Republican
We see there are two 1972 and two 1974 addresses, but none for 1973. What? I went to the Nixon Foundation website, spent about 10 minutes trying to deconflict this, and finally threw my hands in the air and decided on implementing a quick fix. Be advised that there are a number of these conflicts to put in order:
> sotu_meta[188, 2] <- "1972_2"
> sotu_meta[190, 2] <- "1974_2"
> sotu_meta[157, 2] <- "1945_2"
> sotu_meta[166, 2] <- "1953_2"
> sotu_meta[170, 2] <- "1956_2"
> sotu_meta[176, 2] <- "1961_2"
> sotu_meta[195, 2] <- "1978_2"
> sotu_meta[197, 2] <- "1979_2"
> sotu_meta[199, 2] <- "1980_2"
> sotu_meta[201, 2] <- "1981_2"
An email to the author of this package is in order. I won't bother with that, but feel free to solve the issue yourself.
With this tragedy behind us, we'll go through tokenizing and removing stop words again for our relevant time frame:
> sotu_meta_recent <- sotu_meta %>%
dplyr::filter(year > 1964)
> sotu_meta_recent %>%
tidytext::unnest_tokens(word, text) -> sotu_unnest_recent
> sotu_recent <- sotu_unnest_recent %>%
dplyr::anti_join(stop_words, by = "word")
As discussed previously, we need to put the data into a DTM before building a model. This is done by creating a word count grouped by year, then passing that to the cast_dtm() function:
> sotu_recent %>%
dplyr::group_by(year) %>%
dplyr::count(word) -> lda_words
> sotu_dtm <- tidytext::cast_dtm(lda_words, year, word, n)
Let's get our model built. I'm going to create six different topics using the Gibbs method, and I specified verbose. It should run 2,000 iterations:
> sotu_lda <-
topicmodels::LDA(
sotu_dtm,
k = 6,
method = "Gibbs",
control = list(seed = 1965, verbose = 1)
)
> sotu_lda
A LDA_Gibbs topic model with 6 topics.
The algorithm gives each topic a number. We can see what year is mapped to what topic. I abbreviate the output since 2002:
> topicmodels::topics(sotu_lda)
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
We see a clear transition between Bush and Obama from topic 2 to topic 4. Here is a table of the count of topics:
> table(topicmodels::topics(sotu_lda))
1 2 3 4 5 6
8 7 5 18 14 5
Topic 4 is the most prevalent, which is associated with Clinton's term also. This output gives us the top five words associated with each topic:
> topicmodels::terms(sotu_lda, 5)
Topic 1 Topic 2 Topic 3
[1,] "future" "america" "administration"
[2,] "tax" "security" "congress"
[3,] "spending" "country" "economic"
[4,] "government" "world" "legislation"
[5,] "economic" "iraq" "energy"
Topic 4 Topic 5 Topic 6
[1,] "people" "world" "federal"
[2,] "american" "people" "programs"
[3,] "jobs" "american" "government"
[4,] "america" "congress" "program"
[5,] "children" "peace" "act"
This all makes good sense, and topic 2 is spot on for the time. If you drill down further to, say, 10, 15, or 20 words, it is even more revealing, but I won't bore you further. What about an application in the tidy ecosystem and a visualization? Certainly! We'll turn the model object into a data frame first and in the process capture the per-topic-per-word probabilities called beta:
> lda_topics <- tidytext::tidy(sotu_lda, matrix = "beta")
> ap_top_terms <- lda_topics %>%
dplyr::group_by(topic) %>%
dplyr::top_n(10, beta) %>%
dplyr::ungroup() %>%
dplyr::arrange(topic, -beta)
We can explore that data further or just plot it as follows:
> ap_top_terms %>%
dplyr::mutate(term = reorder(term, beta)) %>%
ggplot2::ggplot(ggplot2::aes(term, beta, fill = factor(topic))) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::facet_wrap(~ topic, scales = "free") +
ggplot2::coord_flip() +
ggthemes::theme_economist_white()
The output of the preceding code is as follows:
This is the top 10 words per topic based on the beta probability. Another thing we can do is look at the probability an address is related to a topic. This is referred to as gamma in the model and we can pull those in just like the beta:
> ap_documents <- tidytext::tidy(sotu_lda, matrix = "gamma")
We now have the probabilities of an address per topic. Let's look at the 1981 Ronald Reagan values:
> dplyr::filter(ap_documents, document == "1981")
# A tibble: 6 x 3
document topic gamma
<chr> <int> <dbl>
1 1981 1 0.286
2 1981 2 0.0163
3 1981 3 0.0923
4 1981 4 0.118
5 1981 5 0.0777
6 1981 6 0.411
Topic 1 is a close second in the topic race. If you think about it, this means that more than six topics would help to create better separation in the probabilities. However, I like just six topics for this tutorial for the purpose of demonstration.
In this tutorial, we looked at topic models in R. We applied the framework to the State of the Union addresses. If you want to stay updated with expert techniques for solving data analytics and explore other machine learning challenges in R, be sure to check out the book 'Mastering Machine Learning with R - Third Edition'.
How to make machine learning based recommendations using Julia [Tutorial]
The rise of machine learning in the investment industry
GitHub Octoverse: top machine learning packages, languages, and projects of 2018