Text as Data

This is a repo for a class on Text as Data


Text as Data

Welcome to my course entitled “Text as Data.” On this page, you will find an overview of the course, a description of each topic covered in the course, and a series of instructions about how to access all of the software and materials necessary for the course.

What is Text as Data?

The past decade has witnessed an explosion of data produced by websites such as Twitter, Facebook, Google, and Wikipedia, but also the mass digitization of historical archives and administrative records. Though these new data sources hold enormous potential to address a range of pressing problems within industry and academia, collecting and analyzing text-based data presents unique challenges. Fortunately, the widespread availability of text-based data coincides with major advances in the fields of computer science and natural language processing. This course will provide students with an overview of popular techniques for collecting, processing, and analyzing text-based data—including screen-scraping, mining data from application programming interfaces or APIs, topic modeling, text networks, and advanced text classifiers.

What Subjects are Covered in this Class?

This class covers a range of different topics that build on top of each other. For example, in the first tutorial, you will learn how to collect data from Twitter, and in subsequent tutorials you will learn how to analyze those data using automated text analysis techniques. For this reason, you may find it difficult to jump towards one of the most advanced issues before covering the basics.

Application Programming Interfaces

Screen-Scraping

Basic Text Analysis

Dictionary-Based Text Analysis

Topic Modeling

Text Networks

Word Embeddings

Who are You?

I am a Professor of Sociology, Public Policy, and Data Science at Duke University who studies political polarization on social media. You can learn more about my research here. Much of the material in the tutorials above draws upon my own research and text analysis techniques I’ve developed. Yet I also draw heavily on a number of other excellent tutorials by a range of different people who I tried to remember to thank in each tutorial above—if I forgot to recognize your work, please email me!

How can I Access the Course Materials?

All of the materials for this course are available online via the links above. Many of the datasets used are loaded directly from my Github page, which also hosts all of the source files necessary to produce the tutorials above.

How can I get started?

This course assumes basic familiarity with the R software. If you are new to R, I recommend the sequence of online courses described on this website to get you started.