Contact Information

Email:
Office Location: 254 Soc/Psych Hall
Office Hours: Tuesdays, 12 - 1:15 pm, Soc/Psych 254
Website: https://www.chrisbail.net
Github: https://github.com/cbail
Teaching Assistants: Kate Coulter / By Appointment &
Joe Littell / Thursdays, 9 - 10 am, Gross 230K
Slack Channel: https://dukemids.slack.com/archives/CRSPSGFHR
Dropbox (to submit Labs): https://sakai.duke.edu/portal/directtool/4049fefe-3ba1-43e8-9cba-a92758c96655/

COVID-19 ANNOUNCEMENT

The worldwide spread of COVID-19 has impacted all of us. Many of us are dealing with unprecedented challenges that have stretched us to our limits. I am working hard to adapt our class to these challenging circumstances but will need to ask for your patience as I continue to assess how to best serve everyone’s needs in a rapidly changing environment. Below is my first effort to outline a path forward for us, but I am very open to feedback from you and may make additional changes in accordance with your needs or requests.

Remote Learning

Thank you all for completing the anonymous survey I distributed about your current situation. This has been enormously helpful as I try to chart the best path forward for our class. Unfortunately, this survey revealed that some people in our class do not have regular access to high quality internet right now, which means at least some people will not be able to participate in classes in real time. To accommodate these people, I am going to distribute links to recordings of each Zoom session in our course’s slack channel for the remainder of the session. We will hold class at our regularly scheduled times. For those of you who are not able to join live, I will be very happy to answer any questions you have by slack at a later point in time.

Course Requirements

As you may have heard by now, Duke University has decided to make all classes graded as “Satisfactory” or “Unsatisfactory” unless you request a letter grade. You will receive a “Satisfactory” grade as long as you achieve more than a 70 (or C-) in our class. In general, I hope this decision will give you the flexibility you need to achieve at a high level in our class. Our move to remote learning does pose significant challenges to the main course requirement in our class: our group projects. When we meet on Tuesday, I am going to discuss a path forward for this course requirement, which will likely involve reassessing what is possible via remoate interaction. I am also hoping that we can try to find some positive energy, however, and perhaps do some work together that might help societal response to COVID-19. For now, we will proceed with our original grading plan for the group project (see grade calculations below). We will also continue with weekly labs, but I am going to make these optional for the rest of the semester. If you do not complete any additional labs, your grade for the semester will be the average grade you have received on the labs before spring break.

Other Forms of Support

If you require support- whether it is health care, mental health care, or financial support- Duke has created a single website where updates are being posted about the university’s response and other resources that are available to you as a member of the Duke community, I urge you to visit https://coronavirus.duke.edu/ regularly to take advantage of the information posted there each day.

Course Description

Welcome to “Data Scraping and Text Analysis” (IDS 703), one of the core courses in Duke University’s Masters Program in Interdisciplinary Data Science (MIDS). The past decade has witnessed an explosion of data produced by websites such as Twitter, Facebook, Google, and Wikipedia, but also the mass digitization of historical archives and administrative records. Though these new data sources hold enormous potential to address a range of pressing problems within industry and academia, collecting and analyzing text-based data presents unique challenges. Fortunately, the widespread availability of text-based data coincides with major advances in the fields of computer science and natural language processing. This course will provide students with an overview of popular techniques for collecting, processing, and analyzing text-based data—including screen-scraping, mining data from application programming interfaces or APIs, topic modeling, text networks, and advanced text classifiers. We will also discuss the challenge of conducting empirical research with these data, including ethics, causal inference, and the external validity of digital sources.

Prerequisites

R will be the main programming language for this course. Though the class will review basic programming techniques such as loops and functions, practical experiences with R is highly recommended. The majority of our time will be spent mastering the following R packages: rvest, rtweet, lda, stm, ldaviz, textnets, and wordVectors as well as a variety of functions in base R. General knowledge of data structures, basic programming, and rudimentary statistics is also required.

Requirements

Success in this course requires that you attend class, complete requisite readings before class, complete weekly homework assignments, complete a mid-term exam, and develop a final group project with several of your classmates. I describe each of these requirements in the following sections.

Readings

You must complete all readings prior to class and come to our meetings prepared to discuss them.

Labs

Weekly homework exercises or “labs” are linked in the course schedule below and are designed to gauge your knowledge of the material presented during that week. Each lab is designed to correspond with a major course topic, which will be indicated in the lab title. Each lab will be posted at least one week prior to its due date so that it is available for students at the start of the first lecture covering that topic and is due via Dropbox after the second lecture on the topic, prior to the next scheduled lecture’s start. For example, Lab 1 which covers material presented on APIs (January 21 and 23) will be posted by the start of class, January 21, and will be due via Dropbox prior to January 28’s lecture.

Mid-Term Exam

The mid-term exam will include approximately 20 multiple choice and short answer questions that cover content from the first half of the class.

Final Project

The final project will be a team-based research project that fuses all of the skills we learn during the course of the semester in order to answer a question that can be solved with text-based data. All projects must analyze at least 1,000 documents and apply some type of automated text analysis to identify meaningful patterns within them and address an empirical question of interest to data scientists, broadly defined. Examples might include a sentiment analysis of a company’s tweets and responses to those tweets in order to measure customer satisfaction, or an analysis that seeks to determine why posts on Facebook fan pages go viral.

Each team must produce a 20-minute presentation that describes their research project, delivered in powerpoint or another medium during a final class period and delivered to the instructor before this time. Each team must also submit a report that details their work and what type of work each student contributed to the project (i.e which team members contributed to which part of the project and precisely what work each team member did)– your goal should be for all members of the team to contribute to each stage of the project. As an additional accountability mechanism, ⅓ of your grade for the final project will be determined by your fellow group members, with the remaining ⅔ of your grade for the project determined by Professor Bail (and this part of the grade will be assigned to the entire team, instead of by individual members).

Participation

Attendance in this course is mandatory and you are expected to be an active participant in all classroom discussions and exercises. If you suffer from social anxiety or if English is not your first language, I encourage to participate in discussions outside of class on our Slack Channel. Uncivil behavior such as engaging in personal conversations during lectures or discussion sections, browsing internet sites not relevant to classroom discussions, and cell phone usage will negatively affect your grade. Your participation grade will be calculated at the end of the semester, but if you would like to receive input on your participation grade at any point during the semester please contact me.

Grading Scheme

Labs: 30%

Mid-Term Exam: 30%

Final Project: 30% (⅓ of this grade is determined by your fellow group members and ⅔ is determined by Professor Bail)

Participation: 10%

General Course Policies

I am a very reasonable person and understand that sometimes “life happens” and you may not be able to complete your work in a timely manner. I will therefore carefully consider all explanations for deviance from the general course policies outlined below.

Attendance

Attendance for all lectures is mandatory. If you have a University-excused reason to miss please follow the proper procedure for letting your TA know about it. You are responsible for knowing and acting in accordance with University policy.

Academic Integrity

Understand and follow the Duke Community Standard. Plagiarism, cheating or other violations will be dealt with according to University policy. All student assignments will be processed by plagiarism detection software.

Mental Health and Stress

If you are suffering from stress, depression, anxiety, or any other mental health issues that are common among Duke students, please consider visiting Counseling and Psychological Services (CAPS) on campus, who provide support on everything from minor to major mental health issues.

Duke Reach - ( email ) serves to direct students and faculty alike to resources that can help them during their times of need. These resources include the Student Behavioral Assessment team, the Duke Wellness Center, and many more resources.

Extra-Credit Policy

There will be no extra credit or make-up assignments.

Syllabus

I reserve the right to make changes to the syllabus, including project due dates and test dates. These changes will be announced as early as possible and no later than one week before materials are due.

Help Us Make This Course Better

Creating high quality teaching materials is hard work! If you ever discover any errors or inconsistencies in the teaching materials on this site, please email one of the teaching assistants and cc me.

Resources

Below I have listed several resources which I hope might be helpful to you for this course and beyond (particularly if you want to pursue the study of text as data after this class).

RStudio Tutorials

In this class, we will use the R software, which is free and open-source. There are a variety of different ways to use R, but the most common way to do so is with the software RStudio, a free Graphical User Interface which you can either run on your laptop, or via a web server. R and RStudio are both supported by a vibrant community of individuals who have created a treasure-trove of learning resources online. Here is a link to some very helpful beginner tutorials, and this link also includes some intermediate and advanced tutorials if you really want to challenge yourself.

The Summer Institutes in Computational Social Science (SICSS)

I am the co-founder of the Summer Institutes in Computational Social Science. These annual events are designed to introduce PhD students and other young faculty members to the field in dozens of places around the globe. Though masters-level students are not currently invited to attend the events, my co-founder and I have created an extensive website that includes links to videos of lectures that he and I give on a range of topics, as well as talks by some of the most renowned scholars in the field, employees of large companies interested in the field such as Facebook, and others who work in non-profits or government. If you find this class interesting or exciting, you may wish to check out some of the videos from these speakers to get a sense of the full-array of work going on in our field.

Stack Overflow

The field of computational social science is growing so rapidly that none of the resources I give you will remain at the cutting edge for long. You will almost certainly encounter issues unique to the data we collect as part of our group research project and/or incompatibilities between software packages and/or your computer. Stack Overflow is a website where computer programmers help each other solve such problems. Individuals ask questions, and others earn “reputation points” for solving their problems—these reputation points are awarded by the person who asks the question as well as other site users who vote upon the elegance/efficiency of each solution. For you, this reputation system means you can quickly identify the most high-quality solutions to your problems.

Twitter/Blogs

Many of the most important advances in computational social science appear first on Twitter or blogs. I therefore encourage you to open a Twitter account- if you don’t already have one- and follow the authors we read, or consider checking out the people I follow. Having a Twitter account will also come in handy for some of the exercises we do in class to collect data from Twitter. Of the many blogs that you might read, I recommend R Bloggers, which provides a concise overview of new functions in R as well as solutions to common problems faced by computational social scientists, as well as those in other fields.

Course Schedule (Spring 2020)

We meet every Tuesday and Thursday unless otherwise noted. Readings must be completed before each class where they appear on the schedule below. Lab assignments are linked below immediately prior to the date they are open (note that each lab’s due date is listed below as well as on the lab page). Links to my presentation slides and annotated code relevant to each lecture are below as well.

Introduction

January 9: Preliminaries

Topics:
Introductions and Housekeeping

Materials:
None

Required reading:
- Salganik, Matthew, Bit by Bit, Introduction & Observing Behavior

Suggested reading:
- David Donoho. 50 Years of Data Science
- Lazer et al. Computational Social Science, Science.
- Lazer et al. Life in the network: the coming age of computational social science, Science.
- Watts, Duncan. Should social science be more solution-oriented?, Nature

January 14: Introduction to Text as Data

Topics:
Strengths and Weaknesses of Text as Data

Materials:
Slides, Document

Required reading:
- Salganik, Matthew, Bit by Bit, Asking Questions.
- Mullainathan, Sendhil. Biased Algorithms Are Easier to Fix Than Biased People, New York Times

Suggested reading:
- Blumenstock et al. Predicting Poverty and Wealth from Mobile Phone Data, Science.

January 16: Ethics in Text as Data

Required reading:
- Salganik, Matthew. Bit by Bit, Ethics.

Suggested reading:
- Robinson Meyer. Everything We Know About Facebook’s Secret Mood Manipulation Experiment, the Atlantic.

Materials:
Slides,

Assignment: Lab 1

- Due: January 28

January 21: Application Programming Interfaces, Part 1

Topics:
What is an API? Credentials, and Rate Limiting

Materials:
Slides, Annotated Code

Required Readings
- Adam Kramer, Jamie Guillory, & Jeffrey Hancock. Emotional Contagion, PNAS.

January 23: Application Programming Interfaces, Part 2

Topics:
Working with the Twitter API

Materials:
Slides, Annotated Code

Assignment: Lab 2

- Due: February 4

January 28: Screen-Scraping, Part 1

Topics:
What is screen-scraping, character encoding

Materials:
Slides, Annotated Code

January 30: Screen-Scraping, Part 2

Topics:
Reading and parsing HTML

Materials:
Slides, Annotated Code

Assignment: Lab 3

- Due: February 13

February 4: Basic Text Analysis, Part 1

Topics:
GREP, Tokenization, Stemming

Materials:
Slides, Annotated Code

Required reading:
- James Evans & Pedro Aceves. Machine Translation: Mining Text for Social Theory. Annual Review of Sociology.

Suggested reading:
- Justin Grimmer & Brandon Stewart. Text as Data: The Promises and Pitfalls of Automated Content Analysis, Political Analysis.
- Bo Pang, Lillian Lee, & Shivakumar Vaithyanathan. Thumbs up: Sentiment Classification using Machine Learning Techniques.
- Kathleen Carley. Extracting Culture Through Textual Analysis. Poetics, 22:291-312.

February 6: No Class

February 11: Basic Text Analysis, Part 2

Topics:
Text pre-processing, n-grams

Materials:
Slides, Annotated Code

Mid-Term Exam & Review

February 13: Mid-Term Review

February 18 Mid-Term Exam

Assignment: Lab 4

- Due: February 28

February 20: Dictionary-Based Analysis

Materials:
Slides, Annotated Code

Required reading:
- Kramer et al. 2014. Experimental evidence of massive-scale emotional contagion through social networks Proceedings of the National Academy of Sciences

Assignment: Lab 5

- Due: March 6

February 25: Topic Modeling

Materials:
Slides, Annotated Code

Required reading:
- Blei, David M. 2012. Probabilistic Topic Models. Communications of the ACM (Note: this is a challenging article, so don’t worry if you are not able to understand every last part).

February 27: Structural Topic Models

Materials:
Slides, Annotated Code, Running R on AWS

March 3: Social Network Analysis, Part I

Topics:
An Introduction to Social Network Analysis

Materials:
Slides, Annotated Code

Required reading:
- Nicholas Cristakis & James Fowler. [Connected: The Surprising Power of Our Social Networks and How they Shape Our Lives], Chapter One.

Additional reading:
- Duncan Watts. 1999. [Small Worlds], Chapter 1. Princeton University Press: 3-8.
- Kieran Healy. (2013). Using Metadata to Find Paul Revere, Blog post.
- David Austin. (2006). How Google Finds Your Needle in the Web’s Haystack, American Mathematical Society Feature Column.

March 5: Social Network Analysis, Part II

Topics:

Materials:
Slides,, Annotated Code

Required reading:
- Nicholas Cristakis & James Fowler. [Connected: The Surprising Power of Our Social Networks and How they Shape Our Lives], Chapter One.

Additional reading:
- Duncan Watts. 1999. [Small Worlds], Chapter 1. Princeton University Press: 3-8.
- Kieran Healy. (2013). Using Metadata to Find Paul Revere, Blog post.
- David Austin. (2006). How Google Finds Your Needle in the Web’s Haystack, American Mathematical Society Feature Column.

Spring Break

March 10: No Class (Spring Break)

March 12: No Class (Spring Break)

March 17: Extended Spring Break due to Coronavirus

March 19: Extended Spring Break due to Coronavirus

March 24: Text Networks, Part 1

Materials:
Slides, Annotated Code

March 26: Text Networks, Part 2

Topics:
Textnets

Materials:
Slides, Annotated Code

March 31: Word Embeddings, Part 1

Materials:
Slides, Annotated Code

April 2: Word Embeddings Part 2

Materials:
Slides, Annotated Code

April 7: Chat Bots

Materials:
Slides on Chat Bots,

April 9: Final Project Presentations

- Groups 1-5, all individual projects not listed below

April 14: Final Project Presentations

- Groups 6-10, individual projects: Abdur, Amandeep