Text as Data Course
Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail

This page builds upon previous tutorials designed to introduce you to extracting and analyzing text-based data from the internet. This tutorial introduces you to the family of text analysis techniques known as topic models, which have become very popular over the past decade. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.

What is Topic Modeling?

In an earlier tutorial on dictionary-based approaches we discussed the use of word frequency counts to extract meaning from texts. We noted an important limitation of this approach: it assumes that each word has one and only one meaning. A much more reasonable assumption is that words assume different meanings based upon their appearance alongside othe words. For example, consider the phrases “running is good for your health” and “running for office is difficult.” The latter sentence has nothing to do with excercise, yet “running” would likely be associated with exercise in many different text analysis dictionaries.

Topic modeling is part of a class of text analysis methods that analyze “bags” or groups of words together—instead of counting them individually–in order to capture how the meaning of words is dependent upon the broader context in which they are used in natural language. Topic modeling is not the only method that does this– cluster analysis, latent semantic analysis, and other techniques have also been used to identify clustering within texts. A lot can be learned from these approaches. Refer to this article for an interesting discussion of cluster analysis for text.

Nevertheless, topic models have two important advantages over simple forms of cluster analysis such as k-means clustering. In k-means clustering, each observation—for our purposes, each document—can be assigned to one, and only one, cluster. Topic models, however, are mixture models. This means that each document is assigned a probability of belonging to a latent theme or “topic.”

The second major difference between topic models and conventional cluster analysis is that they employ more sophisticated iterative Bayesian techniques to determine the probability that each document is associated with a given theme or topic. This means that documents are initially given a random probability of being assigned to topics, but the probabilities become increasingly accurate as more data are processed.

An example of topic modeling

To make this discussion more concrete, let’s look at an example of topic modeling applied to a corpus of articles from the journal Science. This analysis was conducted by David Blei, who was a pioneer in the field of topic modeling.

This figure illustrates how a small chunk of text from a single document was classified via topic modeling. The colored figures on the left of the diagram describe topics identified by the model, and the words in each box describe the most frequent words that appear in each topic. These words are also colored within the text in the middle of the picture, and the bar graph to the right describes the mixture of topics identified within this particular document (once again, topic models are mixture models where most documents have some resemblence to each topic, however, small or insignificant).

It is important to note that topic models are no substitute for human interpretation of a text—instead, they are a way of making educated guesses about how words cohere into different latent themes by identifying patterns in the way they co-occur within documents. There is quite a bit of hype about topic models, and many people are somewhat disappointed when they discover they produce uninformative or even unintelligble results. I will discuss the best “use cases” for topic models in additional detail below.

Latent Dirichlet Allocation

The most common form of topic modeling is Latent Dirichlet Allocation or LDA. LDA works as follows:

  1. First, LDA requires the research to specify a value of k or the number of topics in the corpus. In practice, this is a very difficult—and consequential—decision. We will discuss procedures that can be used to identify the appropriate value of k in the common scenario where one does not have strong theoretical a priori about the number of latent themes that might exist in a corpus;

  2. Each word that appears in the corpus is randomly assigned to one of the k topics. If you are a stickler for the details, this assignment is technically not random, since it involves a Dirichlet distribution that employs a probability simplex instead of real numbers (this simply means that the numbers assigned across the k topics add up to 1)

  3. Topic assignments for each word are updated in an iterative fashion by updating the prevalence of the word across the k topics, as well as the prevalence of the topics in the document. This stage of LDA employs the Term Frequency-Inverse Document Frequency metric discussed in a previous tutorial. Topic assignments are updated up to a user-specified threshold, or when iterations begin to have little impact on the probabilities assigned to each word in the corpus.

LDA, and most other forms of topic modeling, produce two types of output. First, one can identify the words that are most frequently associated with each of the k topics specified by the user. Second, LDA produces the probability that each document within the corpus is associated with each of the k topics specified by the user as well. Researchers often then assign each document to the topic it most closely resembles, or set a probability threshold to define the document as containing one or more of the k topics.

Running Your First Topic Model

Let’s give it a try. We are going to begin with the topicmodels package, which you may need to install. This package comes with a dataset of 2,246 news articles from the Associated Press that we will use to run our first model.

library(topicmodels)
data("AssociatedPress")

The workhorse function within the topicmodels package is LDA, which performs Latent Dirichlet Allocation. As I described above, the user must specify a value of k, or the number of topics in the corpus, in order to run a topic model. For a dataset as diverse as the Associated Press articles described above, it is very difficult to make an educated guess about the number of topics we might discover, so to get us starded, let’s pick a random number: 30.

AP_topic_model<-LDA(AssociatedPress, k=10, control = list(seed = 321))

We use the control argument to pass a random number (321) to seed the assignment of topics to each word in the corpus. We can use the control argument to specify a number of different options as well, such as the maximum number of iterations that we want our topic model to perform. For your information, it may take a while for the code above to run, since the default setting of the LDA package is to perform a large number of iterations.

The tidytext package that I have discussed at length in previous tutorials has some useful functions for extracting the the probability that each word in the corpus is assigned with one of the twenty topics. For instance, let’s try the following code which was written by Julia Silge to produce bar graphs that describe the top terms for each topic:

library(tidytext)
library(dplyr)
library(ggplot2)

AP_topics <- tidy(AP_topic_model, matrix = "beta")

ap_top_terms <- 
  AP_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)


ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()