Text as Data Course
Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
This page builds upon previous tutorials designed to introduce you to extracting and analyzing text-based data from the internet. This tutorial introduces you to the family of text analysis techniques known as topic models, which have become very popular over the past decade. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.
In an earlier tutorial on dictionary-based approaches we discussed the use of word frequency counts to extract meaning from texts. We noted an important limitation of this approach: it assumes that each word has one and only one meaning. A much more reasonable assumption is that words assume different meanings based upon their appearance alongside othe words. For example, consider the phrases “running is good for your health” and “running for office is difficult.” The latter sentence has nothing to do with excercise, yet “running” would likely be associated with exercise in many different text analysis dictionaries.
Topic modeling is part of a class of text analysis methods that analyze “bags” or groups of words together—instead of counting them individually–in order to capture how the meaning of words is dependent upon the broader context in which they are used in natural language. Topic modeling is not the only method that does this– cluster analysis, latent semantic analysis, and other techniques have also been used to identify clustering within texts. A lot can be learned from these approaches. Refer to this article for an interesting discussion of cluster analysis for text.
Nevertheless, topic models have two important advantages over simple forms of cluster analysis such as k-means clustering. In k-means clustering, each observation—for our purposes, each document—can be assigned to one, and only one, cluster. Topic models, however, are mixture models. This means that each document is assigned a probability of belonging to a latent theme or “topic.”
The second major difference between topic models and conventional cluster analysis is that they employ more sophisticated iterative Bayesian techniques to determine the probability that each document is associated with a given theme or topic. This means that documents are initially given a random probability of being assigned to topics, but the probabilities become increasingly accurate as more data are processed.
An example of topic modeling
To make this discussion more concrete, let’s look at an example of topic modeling applied to a corpus of articles from the journal Science. This analysis was conducted by David Blei, who was a pioneer in the field of topic modeling.
This figure illustrates how a small chunk of text from a single document was classified via topic modeling. The colored figures on the left of the diagram describe topics identified by the model, and the words in each box describe the most frequent words that appear in each topic. These words are also colored within the text in the middle of the picture, and the bar graph to the right describes the mixture of topics identified within this particular document (once again, topic models are mixture models where most documents have some resemblence to each topic, however, small or insignificant).
It is important to note that topic models are no substitute for human interpretation of a text—instead, they are a way of making educated guesses about how words cohere into different latent themes by identifying patterns in the way they co-occur within documents. There is quite a bit of hype about topic models, and many people are somewhat disappointed when they discover they produce uninformative or even unintelligble results. I will discuss the best “use cases” for topic models in additional detail below.
Latent Dirichlet Allocation
The most common form of topic modeling is Latent Dirichlet Allocation or LDA. LDA works as follows:
First, LDA requires the research to specify a value of k or the number of topics in the corpus. In practice, this is a very difficult—and consequential—decision. We will discuss procedures that can be used to identify the appropriate value of k in the common scenario where one does not have strong theoretical a priori about the number of latent themes that might exist in a corpus;
Each word that appears in the corpus is randomly assigned to one of the k topics. If you are a stickler for the details, this assignment is technically not random, since it involves a Dirichlet distribution that employs a probability simplex instead of real numbers (this simply means that the numbers assigned across the k topics add up to 1)
Topic assignments for each word are updated in an iterative fashion by updating the prevalence of the word across the k topics, as well as the prevalence of the topics in the document. This stage of LDA employs the Term Frequency-Inverse Document Frequency metric discussed in a previous tutorial. Topic assignments are updated up to a user-specified threshold, or when iterations begin to have little impact on the probabilities assigned to each word in the corpus.
LDA, and most other forms of topic modeling, produce two types of output. First, one can identify the words that are most frequently associated with each of the k topics specified by the user. Second, LDA produces the probability that each document within the corpus is associated with each of the k topics specified by the user as well. Researchers often then assign each document to the topic it most closely resembles, or set a probability threshold to define the document as containing one or more of the k topics.
Let’s give it a try. We are going to begin with the topicmodels
package, which you may need to install. This package comes with a dataset of 2,246 news articles from the Associated Press that we will use to run our first model.
library(topicmodels)
data("AssociatedPress")
The workhorse function within the topicmodels
package is LDA
, which performs Latent Dirichlet Allocation. As I described above, the user must specify a value of k, or the number of topics in the corpus, in order to run a topic model. For a dataset as diverse as the Associated Press articles described above, it is very difficult to make an educated guess about the number of topics we might discover, so to get us starded, let’s pick a random number: 30.
AP_topic_model<-LDA(AssociatedPress, k=10, control = list(seed = 321))
We use the control
argument to pass a random number (321
) to seed the assignment of topics to each word in the corpus. We can use the control
argument to specify a number of different options as well, such as the maximum number of iterations that we want our topic model to perform. For your information, it may take a while for the code above to run, since the default setting of the LDA package is to perform a large number of iterations.
The tidytext
package that I have discussed at length in previous tutorials has some useful functions for extracting the the probability that each word in the corpus is assigned with one of the twenty topics. For instance, let’s try the following code which was written by Julia Silge to produce bar graphs that describe the top terms for each topic:
library(tidytext)
library(dplyr)
library(ggplot2)
AP_topics <- tidy(AP_topic_model, matrix = "beta")
ap_top_terms <-
AP_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
Let’s take a moment to review the results. There is some evidence that the words associated with some of the topics cohere into something that could be called a latent theme or topic within the text. For example, topic #1 includes words that are often used to discuss economics and markets and topic #9 appears to describe legal issues. But there are many other topics that don’t make much sense. Consider, for example, topic #8, which includes the terms “worker,”two“,”new“,”water“, and”area" among others.
Now YOU try it
Pick any of the datasets we’ve collected thus far (or one of the ones listed on the list of crowd-sourced datasets in the first lecture: http://bit.ly/1JA1CF3);
Prepare the data so that it can be analyzed in the topicmodels
package
Run three models and try to identify an appropriate value for k
(the number of topics)
Reading Tea Leaves
As is perhaps clear, this type of post-hoc interpretation of topic models is rather dangerous. As Jonathan Chang has written, it can quickly come to resemble the process of “reading tea leaves,” or finding meaning in patterns that are in fact quite arbitrary or even random. One of the most consequential decisions in the process of topic modeling, as I mentioned above, is specifying the number of topics (k) in the corpus. Above, we used a purely arbitrary number of topics. Though in some cases researchers may have reasonable guesses about the expected number of topics, in many cases they will not. We will review some techniques for taking a good guess of the value of k when the research does not have strong a priori about the number of topics, but all of these techniques are imperfect. Indeed, I once heard David Mimno, who was one of the pioneers in topic modeling, describe the method as a “tool for reading.” By this he meant that topic modeling does not reveal the “true” meaning of documents within a corpus, but is instead a powerful tool for identifying general trends in a corpus that can then be analyzed in a more granular manner using other techniques. Despite this rather humble assessment of the promise of topic models, many people continue to employ them as if they do in fact reveal the true meaning of texts, which I fear may create a surge in “false positive” findings in studies that employ topic models.
LDA is but one of many different types of topic modeling. Though LDA is perhaps the most common form of topic modeling, a number of associated techniques now exist, including Dynamic Topic Models, Correlated Topic Models, Hierarchical Topic Models, and so on. One of the most increasingly popular techniques to emerge in recent years, however, is Structural Topic Modeling, or STM. STM is very similar to LDA, but it employs meta data about documents (such as the name of the author or the date in which the document was produced) to improve the assignment of words to latent topics in a corpus. For a more detailed discussion of the technical implementation of STM, see this paper, which analyzes the same dataset we will employ below.
Another major advantage of STM is that there is a very high quality R package to implement this package called stm
. This package is not only useful for performing STM, but for validating topic models, determining the appropriate value of k
and visualizing or further inrpreting topic models. It even includes a handy function for pre-proceessing text. Let’s take a look at an overview of the methods in the stm
package produced by the package’s authors:
Let’s work with some new data that is a .csv file that describes 13,254 posts on six political blogs from 2008 that are employed in the stm
package vignette. These data were collected by Einstein and Xing. You can download this large .csv file as follows:
google_doc_id <- "1LcX-JnpGB0lU1iDnXnxB6WFqBywUKpew" # google file ID
poliblogs<-read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", google_doc_id), stringsAsFactors = FALSE)
If you browse this dataframe, you’ll see that it not only includes the text of the blog posts, but also the names of the blog, the day of the year on which the blog post was produced, and a “conservative/liberal” label for each blog. We will use these variables later to demonstrate the power of meta-data for topic modeling.
Before we get into structural topic modeling, let’s try out the stm
package’s text pre-processing functions. The textProcessor
function automatically removes a) punctuation; b) stop words; c) numbers, and d) stems each word. If you need a refresher on why these steps are important, see my previous tutorial entitled “Basic Text Analysis.” The function requires us to specify the part of the dataframe where the documents we want to analyze are (ours are called documents
), and it also requires us to name the dataset where the rest of the meta data live (poliblogs
).
library(stm)
## stm v1.3.3 (2018-1-26) successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
processed <- textProcessor(poliblogs$documents, metadata = poliblogs)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Somewhat unusually, the stm
package also requires us to store the documents, meta data, and “vocab”—or total list of words described in the documents—in separate objects (see code below). The first line of code eliminates both extremely common terms and extremely rare terms, as is common practice in topic modeling, since such terms make word-topic assignment much more difficult.
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
## Removing 83198 of 123990 terms (83198 of 2298953 tokens) due to frequency
## Your corpus now has 13246 documents, 40792 terms and 2215755 tokens.
docs <- out$documents
vocab <- out$vocab
meta <-out$meta
Before we run our first model, we have to make another decision about the number of topics we might expect to find in the corpus. Let’s start out with 10. We also need to specify how we want to use the meta data. This model uses both the “rating” variable (that describes whether the blog is liberal or conservative) as well as the day or date variable to improve topic classification. Note that the choice of variables used at this stage can be very consequential– in this case, we may fail to identify certain topics that appear on both liberal and conservative blogs (or wrongly conclude that they are separate issues).
Before we run the model, readers should also note that the STM package also has an argument that allows one to specify the type of initialization or randomization that should be used—in this case we are using spectral initialization, which has several advantages over a random seed that are discussed in the paper linked above.
First_STM <- stm(documents = out$documents, vocab = out$vocab,
K = 10, prevalence =~ rating + s(day) ,
max.em.its = 75, data = out$meta,
init.type = "Spectral", verbose = FALSE)
You may notice that this code takes quite a while to run depending upon your machine. Once again, we can begin to inspect our results by browsig the top words associated with each topic. The stm
package has a useful function that visualizes these results called plot
:
plot(First_STM)
This visualization describes the prevalence of the topic within the entire corpus as well as the top three words associated with the topic. As in our earlier example, you may see that there are some topics that seem plausible, but many others that do not seem very coherent or meaningful. The stm
package has another useful function called findThoughts
which extracts passages from documents within the corpus that load high on topics specified by the user.
findThoughts(First_STM, texts = poliblogs$documents,
n = 2, topics = 3)
##
## Topic 3:
## So what will happen tomorrow in the Indiana and North Carolina primaries? If you look at the totality of all the polls thus far, they suggest that in terms of both the pledged del count and popular vote, an Obama win in North Carolina will be nearly canceled out by a corresponding Hillary victory in Indiana -- meaning the whole night could end up being more or less a wash. This isn't to say that Obama won't come out on top. If the current average of polls holds up, he'll likely net gains in both the pledged del and popular vote count, thanks to North Carolina's greater overall population. But if you project out the size of the victories based on current polls, it looks as if his net gains in both categories could be negligible. The polls have shown Hillary's overall lead in Indiana to be in the mid-single digits. But some polls suggest that late deciders will break her way, and precedent suggests the same, given that this happened in both Ohio and Pennsylvania. So let's assume a 55%-45% win for Hillary in Indiana. Meanwhile, Obama leads in North Carolina by almost ten, so let's assume Obama will win that state by a similar margin of 10 points. If that happens, our calculations suggest that Obama will walk away with a slight edge for the night of roughly 95 pledged delegates to Hillary's 92. That's because Indiana has 72 delegates, which would roughly break down to 40 for Hillary and 32 for Obama. And North Carolina has 115 delegates, which would roughly break down to 63 for Obama and 52 for Hillary. What about the popular vote? Going by Chuck Todd's projections for the total popular votes in these two states, Obama -- assuming a roughly 10-point margin in both states -- would gain a net popular vote victory for the night of about 60,000 votes. In short, the night could end with little change in the delegate or popular vote margins between the two Dems. This would allow Hillary to argue that the contest should continue, and enable her to keep sowing doubts with the super-dels by asking why he can't "close the deal." But this would also bring Obama that much closer to the nomination by making it that much harder for Hillary to ever reach 2,025 delegates -- while bringing himself closer to that magic number. The latest polls, along with a poll-of-polls bottom line, after the jump. IndianaPollster.com Average: Clinton 49.5%, Obama 43.3%SurveyUSA: Clinton 54%, Obama 42%ARG: Clinton 53%, Obama 45%Suffolk: Clinton 49%, Obama 43%Zogby: Obama 44%, Clinton 42%InsiderAdvantage: Clinton 47%, Obama 40%North CarolinaPollster.com Average: Obama 50.1%, Clinton 41.5%InsiderAdvantage: Obama 48%, Clinton 45%PPP (D): Obama 53%, Clinton 43%ARG: Obama 50%, Clinton 42%Zogby: Obama 48%, Clinton 40%Rasmussen: Obama 49%, Clinton 40%
## A surge by Barack Obama has allowed him to leap ahead or pull even in several states he trailed badly in last week which means that it is likely neither he or Hillary Clinton will come out of Super Tuesday with a large lead in delegates.According to published polls and analysts in the know, Obama has an edge in Idaho, Colorado, Minnesota, Kansas, Alabama, Georgia, North Dakota and Illinois while Hillary Clinton is ahead in New York, New Jersey, Tennessee, New Mexico, Oklahoma and Arkansas.Up for grabs are California, Connecticut, Democrats Abroad, Arizona, Missouri, Delaware, Utah, American Samoa, Alaska, Massachusetts. Of those states, Hillary is ahead in Massachusetts with Obama closing while Obama has surged ahead in California but not by much. Clinton is also ahead in Connecticut and Missouri but by less than 5 points.Obama has been surging in the national polls as well, drawing within the margin of era in most of the daily tracking polls. This could mean that a few of those toss up states may end up going to Obama.As far as delegate totals, most anaylsts expect only a 50-75 delegate edge to either candidate come Wednesday morning. And this could mean - probably means - that the Democratic race will be decided by Superdelegates; those 750 elected officials who will be attending the convention and can pledge any candidate regardless of how their home state voted.Chris Bowers, a savvy Democratic strategists lays out the scenario:[Q]uick math shows that after Super Tuesday, only 1,428 pledged delegates will still be available. Now, here is where the problem shows up. According to current polling averages, the largest possible victory for either candidate on Super Tuesday will be Clinton 889 pledged delegates, to 799 pledged delegates for Obama. (In all likelihood, the winning margin will be lower than this, but using these numbers helps emphasize the seriousness of the situation.) As such, the largest possible pledged delegate margin Clinton can have after Super Tuesday is 937 to 862. (While it is possible Obama will lead in pledged delegates after Super Tuesday, it does not currently seem possible for Obama to have a larger lead than 75). That leaves Clinton 1,088 pledged delegates from clinching the nomination, with only 1,428 pledged delegates remaining. Thus, in order to win the nomination without the aid of super delegates, in her best-case scenario after Super Tuesday, Clinton would need to win 76.2% of all remaining pledged delegates. Given our proportional delegate system, there is simply no way that is going to happen unless Obama drops out.Clinton already has a large, unofficial lead in Superdelegates with at least 184 pledged to her candidacy already to Obama's 95. As the establishment candidate, Clinton can be expected to scoop up a large percentage of the remaining Supers - unless Obama can prove he would be a stronger candidate against McCain in the general election. In this case, both candidates poll within the margin of error against McCain so no advantage to either can be seen.A likely scenario would be the Clinton machine being able to rack up enough endorsements to put Hillary over the top within a couple of weeks of the last Democratic primary in June.
Choosing a value for k
The stm
package has a useful function called searchK
which allows the user to specify a range of values for k
, runs STM models for each value of ‘k’, and then outputs multiple goodness-of-fit measures that are very useful in identifying a range of values of k
that provide the best fit for the data. The syntax of this function is very similar to the stm
function, except that the user specifies a range for k
as one of the arguments. In the code below, we search all values of k
between 7 and 10.
findingk <- searchK(out$documents, out$vocab, K = c(10:30),
prevalence =~ rating + s(day), data = meta, verbose=FALSE)
plot(findingk)
The next step is to plot the various fit measures:
Once again, readers should note that these measures are very imperfect, and are not a superior alternative to human validation of the topic models by carefully inspecting not only the top words associated with each document, but also conducting more focused analyses of the documents themselves.
One of the principal advantages of STM is that one can examine the relationship between topics and various covariates of interest. Here we use the estimateEffect
function to examine the relationship between the liberal/conservative rating
variable and the first 10 topics, as well as time (day
).
predict_topics<-estimateEffect(formula = 1:10 ~ rating + s(day), stmobj = First_STM, metadata = out$meta, uncertainty = "Global")
Once we have the model, we can plot the relationships. The code below picks three topics and plots them according to their association with the liberal/conservative rating
variable.
plot(predict_topics, covariate = "rating", topics = c(3, 5, 9),
model = First_STM, method = "difference",
cov.value1 = "Liberal", cov.value2 = "Conservative",
xlab = "More Conservative ... More Liberal",
main = "Effect of Liberal vs. Conservative",
xlim = c(-.1, .1), labeltype = "custom",
custom.labels = c('Topic 3', 'Topic 5','Topic 9'))
We can also plot change in the prevalence of topic over time. The code below plots change in the prevalence of topic 3.
plot(predict_topics, "day", method = "continuous", topics = 3,
model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)")
monthseq <- seq(from = as.Date("2008-01-01"),
to = as.Date("2008-12-01"), by = "month")
monthnames <- months(monthseq)
axis(1,at = as.numeric(monthseq) - min(as.numeric(monthseq)),
labels = monthnames)
Topic models have become a standard tool within quantitative text analysis for many different reasons. Topic models can be much more useful than simple word frequency or dictionary based approaches depending upon the use case. Topic models tend to produce the best results when applied to texts that are not too short (such as tweets), and those that have a consistent structure.
At the same time, topic models have a number of important limitations. To begin, the term “topic” is somewhat ambigious, and by now it is perhaps clear that topic models will not produce highly nuanced classification of texts. Second, topic models can easily be abused if they are wrongly understood as an objective representation of the meaning of a text. Once again, these tools might be more accurately be described as “tools for reading.” The results of topic models should not be over-interpreted unless the researcher has strong theoretical apriori about the number of topics in a given corpus, or if the researcher has carefully validated the results of a topic model using both the quantitative and qualitative techniques described above.