Text as Data Course
Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail

Introduction

This is the first in a series of tutorials I’ve created about collecting data from web-based sources such as Twitter and analyzing them using various forms of automated text analysis. Before we proceed to the technical aspects of these techniques, I want to give you some sense of where the came from:

What is Digital Trace Data?

The past decade has witnessed an increasingly voluminous amount of digital data that is produced on the internet which describes human behavior and other objects of scholarly inquiry. As the figure below shows, recent decades have not only witnessed an increase in the amount of text based data, but also increased computing power which is increasingly necessary to analyze it. Together, these two shifts hold the potential to significantly expand the scope of research in many different fields.

Strengths of Digital Trace Data

I will begin by discussing some of the positive aspects of digital trace data, and then move on to some of the challenges. In so doing, I draw upon Matt Salganik’s Book Bit by Bit which I highly recommend not only for a more detailed discussion of digital trace data, but the nascent field of computational social science more broadly.

Always On

One of the most attractive features of digital trace data is that it is continuously collected, unlike surveys which usually only provide a brief snapshot of the social world. As the image below indicates, social media can occasionally provide a glimpse of major events such as protests, revolutions, or stock market surges, as they unfold.

Non-Reactive

Another important advantage of digital trace data is that it is non-reactive, or not produced via interaction between researchers and those who they study. In some cases, this may lead to significant reductions in social desirability bias or other forms of interviewer effects. Consider, for example, the use of Google search data to study self-induced abortion (see figure below)

Captures Social relationships

Digital trace data are also somewhat unusual in that they often describe social relationships. Whereas conventional survey techniques usually only measure characteristics of individual subjects, for example, digital trace data can often be used to measure social relationships such as the network of European politicians pictured below.

Weaknesses of Digital Trace Data

Despite the considerable advantages of digital trace data described above, they also create a range of challenges for empirical observation and causal inference.

Incomplete

Though much is often made of the size and scale of digital trace data that can be collected, newcomers to the field are often surprised about the amount of data that are often missing or incomplete. Consider, for example, a study of bullying behavior on social media— many of the most abusive posts that might be of interest to a researcher are often removed by Facebook before one might attempt to study them.

Inaccessible

An even more formidable challenge is that data are often inaccessible. Though Twitter provides a massive amount of publicly available data, the vast majority of data generated on Facebook is private. Though some Facebook pages such as “fan pages” have default public settings, the vast majority of Facebook users set their default privacy settings in a manner that only allows people to access their data if they are affiliated with each other as “friends.”

Non-Representative

Another core challenge facing those who wish to work with digital trace data is that a random sample of Facebook or Twitter users is not representative of the broader population of the United States, or most other countries. The figure below presents some data from the Wall Street Journal on user demographics of several social media sites that demonstrates significant differences by platform according to race. On the other hand, usage of Facebook has become so widespread that some readers might be surprised to see how much more representative it has become of the U.S. public in recent years.

Drift

MySpace was once the largest social media site in the world, according to some analysts. It is now resides in the graveyard of internet history like so many other websites. This raises the risk of “drift” in digital trace data– platforms not only shift in their overall popularity (which of course has important implications for their representativeness), but also according to who uses them and why. Though Facebook was once the most popular platform for U.S. undergraduates, many have shifted to Instagram or Snapchat—possibly in reaction to the uptick of Facebook usage by their parents’ generation :)

Algorithmic Counfounding

Sometimes, digital trace data that appear to describe human behavior actually reflect changes in the way humans interact with algorithms. One popular example of this is the “parable of Google Flu.” Google Flu was once a popular tool that allowed users to estimate the prevalence of influenza using Google search data. The tool was so accurate that some suggested it should displace official surveys from the Centers for Disease Control (CDC). Yet in early 2013, Google estimates were far higher than those from the CDC. Researchers later discovered that estimates of influenza had been inflated by google advertising links about the flu that people were clicking on that had appeared in their web browsers after they searched for information about symptoms of the common cold. This is sometimes refered to as “blue-team” dynamics.

Unstructured

Digital Trace data are also often very messy. Newcomers to the field often assume that because data are generated in digital format that they are well structured, easily searchable, and quickly transposable across different formats. As we shall see in future tutorials, this is most often not true— a recent New York Times article indicated that data scientists spend upwards of 80% of their time cleaning data!

Sensitive

Digital trace data are also often very sensitive. Recent events involving Facebook and the Cambridge Analytica Political Consulting Firm underscore the dangers of unfettered access to large troves of digital trace data, but there were many more—arguably more invasive—data breaches long before this recent event. One such incident, pictured below, involved European researchers who mined data from the internet dating site OK Cupid and then publicly released their data online.

Positivity-Bias

Finally, digital trace data often have performative dimensions. Many people do not report negative information about themselves online precisely because they know that their friends, colleagues— or other people they do not know— may be watching them. This creates another common form of bias in social media research.

Exploring Text-Based Datasets

Before we proceed to analyze the data we collected in previous tutorials from Facebook and Twitter, it is worth noting that there are numerous high-quality text datasets that are publicly available on the internet. These include very large corpora from the New York Times, Google, Wikipedia, Reuters, and many other sites). I curate a crowd-sourced list of promising datasets here.

Now YOU Try It

Pair up with your neighbor.
Pick a dataset from the list in the previous section—or another one that you are hoping to analyze after this course;
Identify at least three strengths and weaknesses of the dataset drawing upon this introduction, or other sources.

The Future of Digital Trace Data

After reviewing so many of the negatives of digital trace data, you may be questioning whether its strengths outweigh its weaknesses. I am overall optimistic about the future of digital trace data research because it is in its infancy. As Salganik (2016) writes, the field has recently experienced a Gartner hype cycle (see figure below). In my opinion, reaching the “plateau of productivity” will most likely require hybrid approaches that combine digital trace data analysis alongside more conventional modes of research such as survey analysis. I’ve written at length about this issue elsewhere, and specifically the potential of using app technology to integrate digital trace data collection with survey research.