Email: christopher.bail@duke.edu
Office Location: 254 Soc/Psych Hall
Office Hours: Tuesdays, 12 - 1:15 pm, Soc/Psych 254
Website: https://www.chrisbail.net
Github: https://github.com/cbail
Teaching Assistants: Kate Coulter / By Appointment &
Joe Littell / Thursdays, 9 - 10 am, Gross 230K
Slack Channel: https://dukemids.slack.com/archives/CRSPSGFHR
Dropbox (to submit Labs): https://sakai.duke.edu/portal/directtool/4049fefe-3ba1-43e8-9cba-a92758c96655/
The worldwide spread of COVID-19 has impacted all of us. Many of us are dealing with unprecedented challenges that have stretched us to our limits. I am working hard to adapt our class to these challenging circumstances but will need to ask for your patience as I continue to assess how to best serve everyone’s needs in a rapidly changing environment. Below is my first effort to outline a path forward for us, but I am very open to feedback from you and may make additional changes in accordance with your needs or requests.
Remote Learning
Thank you all for completing the anonymous survey I distributed about your current situation. This has been enormously helpful as I try to chart the best path forward for our class. Unfortunately, this survey revealed that some people in our class do not have regular access to high quality internet right now, which means at least some people will not be able to participate in classes in real time. To accommodate these people, I am going to distribute links to recordings of each Zoom session in our course’s slack channel for the remainder of the session. We will hold class at our regularly scheduled times. For those of you who are not able to join live, I will be very happy to answer any questions you have by slack at a later point in time.
Course Requirements
As you may have heard by now, Duke University has decided to make all classes graded as “Satisfactory” or “Unsatisfactory” unless you request a letter grade. You will receive a “Satisfactory” grade as long as you achieve more than a 70 (or C-) in our class. In general, I hope this decision will give you the flexibility you need to achieve at a high level in our class. Our move to remote learning does pose significant challenges to the main course requirement in our class: our group projects. When we meet on Tuesday, I am going to discuss a path forward for this course requirement, which will likely involve reassessing what is possible via remoate interaction. I am also hoping that we can try to find some positive energy, however, and perhaps do some work together that might help societal response to COVID-19. For now, we will proceed with our original grading plan for the group project (see grade calculations below). We will also continue with weekly labs, but I am going to make these optional for the rest of the semester. If you do not complete any additional labs, your grade for the semester will be the average grade you have received on the labs before spring break.
Other Forms of Support
If you require support- whether it is health care, mental health care, or financial support- Duke has created a single website where updates are being posted about the university’s response and other resources that are available to you as a member of the Duke community, I urge you to visit https://coronavirus.duke.edu/ regularly to take advantage of the information posted there each day.
Welcome to “Data Scraping and Text Analysis” (IDS 703), one of the core courses in Duke University’s Masters Program in Interdisciplinary Data Science (MIDS). The past decade has witnessed an explosion of data produced by websites such as Twitter, Facebook, Google, and Wikipedia, but also the mass digitization of historical archives and administrative records. Though these new data sources hold enormous potential to address a range of pressing problems within industry and academia, collecting and analyzing text-based data presents unique challenges. Fortunately, the widespread availability of text-based data coincides with major advances in the fields of computer science and natural language processing. This course will provide students with an overview of popular techniques for collecting, processing, and analyzing text-based data—including screen-scraping, mining data from application programming interfaces or APIs, topic modeling, text networks, and advanced text classifiers. We will also discuss the challenge of conducting empirical research with these data, including ethics, causal inference, and the external validity of digital sources.
R will be the main programming language for this course. Though the class will review basic programming techniques such as loops and functions, practical experiences with R is highly recommended. The majority of our time will be spent mastering the following R packages: rvest
, rtweet
, lda
, stm
, ldaviz
, textnets
, and wordVectors
as well as a variety of functions in base R. General knowledge of data structures, basic programming, and rudimentary statistics is also required.
All other readings are linked under course material as indicated in the course schedule below.
Success in this course requires that you attend class, complete requisite readings before class, complete weekly homework assignments, complete a mid-term exam, and develop a final group project with several of your classmates. I describe each of these requirements in the following sections.
You must complete all readings prior to class and come to our meetings prepared to discuss them.
Weekly homework exercises or “labs” are linked in the course schedule below and are designed to gauge your knowledge of the material presented during that week. Each lab is designed to correspond with a major course topic, which will be indicated in the lab title. Each lab will be posted at least one week prior to its due date so that it is available for students at the start of the first lecture covering that topic and is due via Dropbox after the second lecture on the topic, prior to the next scheduled lecture’s start. For example, Lab 1 which covers material presented on APIs (January 21 and 23) will be posted by the start of class, January 21, and will be due via Dropbox prior to January 28’s lecture.
The mid-term exam will include approximately 20 multiple choice and short answer questions that cover content from the first half of the class.
The final project will be a team-based research project that fuses all of the skills we learn during the course of the semester in order to answer a question that can be solved with text-based data. All projects must analyze at least 1,000 documents and apply some type of automated text analysis to identify meaningful patterns within them and address an empirical question of interest to data scientists, broadly defined. Examples might include a sentiment analysis of a company’s tweets and responses to those tweets in order to measure customer satisfaction, or an analysis that seeks to determine why posts on Facebook fan pages go viral.
Each team must produce a 20-minute presentation that describes their research project, delivered in powerpoint or another medium during a final class period and delivered to the instructor before this time. Each team must also submit a report that details their work and what type of work each student contributed to the project (i.e which team members contributed to which part of the project and precisely what work each team member did)– your goal should be for all members of the team to contribute to each stage of the project. As an additional accountability mechanism, ⅓ of your grade for the final project will be determined by your fellow group members, with the remaining ⅔ of your grade for the project determined by Professor Bail (and this part of the grade will be assigned to the entire team, instead of by individual members).
Attendance in this course is mandatory and you are expected to be an active participant in all classroom discussions and exercises. If you suffer from social anxiety or if English is not your first language, I encourage to participate in discussions outside of class on our Slack Channel. Uncivil behavior such as engaging in personal conversations during lectures or discussion sections, browsing internet sites not relevant to classroom discussions, and cell phone usage will negatively affect your grade. Your participation grade will be calculated at the end of the semester, but if you would like to receive input on your participation grade at any point during the semester please contact me.
Labs: 30%
Mid-Term Exam: 30%
Final Project: 30% (⅓ of this grade is determined by your fellow group members and ⅔ is determined by Professor Bail)
Participation: 10%
I am a very reasonable person and understand that sometimes “life happens” and you may not be able to complete your work in a timely manner. I will therefore carefully consider all explanations for deviance from the general course policies outlined below.
Attendance for all lectures is mandatory. If you have a University-excused reason to miss please follow the proper procedure for letting your TA know about it. You are responsible for knowing and acting in accordance with University policy.
Understand and follow the Duke Community Standard. Plagiarism, cheating or other violations will be dealt with according to University policy. All student assignments will be processed by plagiarism detection software.
If you are suffering from stress, depression, anxiety, or any other mental health issues that are common among Duke students, please consider visiting Counseling and Psychological Services (CAPS) on campus, who provide support on everything from minor to major mental health issues.
Duke Reach - ( email ) serves to direct students and faculty alike to resources that can help them during their times of need. These resources include the Student Behavioral Assessment team, the Duke Wellness Center, and many more resources.
There will be no extra credit or make-up assignments.
I reserve the right to make changes to the syllabus, including project due dates and test dates. These changes will be announced as early as possible and no later than one week before materials are due.
Creating high quality teaching materials is hard work! If you ever discover any errors or inconsistencies in the teaching materials on this site, please email one of the teaching assistants and cc me.
Below I have listed several resources which I hope might be helpful to you for this course and beyond (particularly if you want to pursue the study of text as data after this class).
In this class, we will use the R software, which is free and open-source. There are a variety of different ways to use R, but the most common way to do so is with the software RStudio, a free Graphical User Interface which you can either run on your laptop, or via a web server. R and RStudio are both supported by a vibrant community of individuals who have created a treasure-trove of learning resources online. Here is a link to some very helpful beginner tutorials, and this link also includes some intermediate and advanced tutorials if you really want to challenge yourself.
The field of computational social science is growing so rapidly that none of the resources I give you will remain at the cutting edge for long. You will almost certainly encounter issues unique to the data we collect as part of our group research project and/or incompatibilities between software packages and/or your computer. Stack Overflow is a website where computer programmers help each other solve such problems. Individuals ask questions, and others earn “reputation points” for solving their problems—these reputation points are awarded by the person who asks the question as well as other site users who vote upon the elegance/efficiency of each solution. For you, this reputation system means you can quickly identify the most high-quality solutions to your problems.
Many of the most important advances in computational social science appear first on Twitter or blogs. I therefore encourage you to open a Twitter account- if you don’t already have one- and follow the authors we read, or consider checking out the people I follow. Having a Twitter account will also come in handy for some of the exercises we do in class to collect data from Twitter. Of the many blogs that you might read, I recommend R Bloggers, which provides a concise overview of new functions in R as well as solutions to common problems faced by computational social scientists, as well as those in other fields.
We meet every Tuesday and Thursday unless otherwise noted. Readings must be completed before each class where they appear on the schedule below. Lab assignments are linked below immediately prior to the date they are open (note that each lab’s due date is listed below as well as on the lab page). Links to my presentation slides and annotated code relevant to each lecture are below as well.
Topics:
Introductions and Housekeeping
Materials:
None
Required reading:
- Salganik, Matthew, Bit by Bit, Introduction & Observing Behavior
Suggested reading:
- David Donoho. 50 Years of Data Science
- Lazer et al. Computational Social Science, Science.
- Lazer et al. Life in the network: the coming age of computational social science, Science.
- Watts, Duncan. Should social science be more solution-oriented?, Nature
Topics:
Strengths and Weaknesses of Text as Data
Required reading:
- Salganik, Matthew, Bit by Bit, Asking Questions.
- Mullainathan, Sendhil. Biased Algorithms Are Easier to Fix Than Biased People, New York Times
Suggested reading:
- Blumenstock et al. Predicting Poverty and Wealth from Mobile Phone Data, Science.
Required reading:
- Salganik, Matthew. Bit by Bit, Ethics.
Suggested reading:
- Robinson Meyer. Everything We Know About Facebook’s Secret Mood Manipulation Experiment, the Atlantic.
Materials:
Slides,
- Due: January 28
Topics:
What is an API? Credentials, and Rate Limiting
Materials:
Slides, Annotated Code
Required Readings
- Adam Kramer, Jamie Guillory, & Jeffrey Hancock. Emotional Contagion, PNAS.
Topics:
Working with the Twitter API
Materials:
Slides, Annotated Code
- Due: February 4
Topics:
What is screen-scraping, character encoding
Materials:
Slides, Annotated Code
Topics:
Reading and parsing HTML
Materials:
Slides, Annotated Code
- Due: February 13
Topics:
GREP, Tokenization, Stemming
Materials:
Slides, Annotated Code
Required reading:
- James Evans & Pedro Aceves. Machine Translation: Mining Text for Social Theory. Annual Review of Sociology.
Suggested reading:
- Justin Grimmer & Brandon Stewart. Text as Data: The Promises and Pitfalls of Automated Content Analysis, Political Analysis.
- Bo Pang, Lillian Lee, & Shivakumar Vaithyanathan. Thumbs up: Sentiment Classification using Machine Learning Techniques.
- Kathleen Carley. Extracting Culture Through Textual Analysis. Poetics, 22:291-312.
Topics:
Text pre-processing, n-grams
Materials:
Slides, Annotated Code
- Due: February 28
Materials:
Slides, Annotated Code
Required reading:
- Kramer et al. 2014. Experimental evidence of massive-scale emotional contagion through social networks Proceedings of the National Academy of Sciences
- Due: March 6
Materials:
Slides, Annotated Code
Required reading:
- Blei, David M. 2012. Probabilistic Topic Models. Communications of the ACM (Note: this is a challenging article, so don’t worry if you are not able to understand every last part).
Materials:
Slides, Annotated Code, Running R on AWS
Materials:
Slides, Annotated Code
Materials:
Slides, Annotated Code
Materials:
Slides, Annotated Code
Materials:
Slides on Chat Bots,
- Groups 1-5, all individual projects not listed below
- Groups 6-10, individual projects: Abdur, Amandeep