Chris Bail
Duke University
www.chrisbail.net

Introduction

This is part of a series of tutorials I’ve written on collecting digital trace data from sites such as Twitter, Facebook, or other internet sources. These earlier tutorials demonstrated some of the potential of digital trace data, but also highlighted many limitations of these new wellsprings of data as well. For example, digital trace data are often incomplete, inaccessible, non-representative, unstructured–and thus difficult to work with– and sensitive in nature. Because of these limitations, there is growing consensus that hybrid approaches are needed that combine digital trace data with more conventional methods such as surveys.

Note: this tutorial is a work in progress. It will be updated soon to include more annotated code

Building Apps for Social Science Research

The term “app” has come to refer to an impressive array of software– from mobile tools we use to find our way around to desktop-based tools for editing photos. In this article, I make the case that apps can also be extremely useful for social scientists. More specifically, I argue that apps can provide a vehicle for social scientists to collect digital trace data alongside survey data. There are numerous advantages of such a hybrid strategy– for example, surveys can be used to collect demographic information about those who produce social media texts in order to evaluate issues of coverage and representativeness. Or, surveys can be used to collect information about other confounding factors. What is more, authentication dialogues within apps provide a natural opportunity to obtain informed consent from social media users– though this was not done in a comprehensive manner in past studies such as those that became central to the Cambridge Analytica scandal.

To demonstrate the potential of apps for social science research it will be useful to provide a more detailed example. Some years ago, I was interested in how social media posts go viral. I was specifically interested in public posts on Facebook fan pages by advocacy organizations and other civil society groups. Using the procedures described in my previous tutorials, I could easily collect the text of all messages and counts of the number of times they were shared or commented upon via Facebook’s Graph API. Yet I was not able to answer important questions about who had viewed such posts, and precisely how they interacted with them. I was also not able to measure critical features of the non-profit groups that I was studying such as their financial resources, number of staff, and their use of offline tactics to call attention to their cause.

To solve this problem, I created a web-based app called “Find Your People.” Find Your People allowed non-profit organizations to get high-quality analysis of their social media outreach via comparisons to their peers who had also installed the app. In return, organizations agreed to share non-public aggregate data about their audiences with me known as Facebook “insights data”– these include metrics such as the number of people age 18-24 who viewed a post on a given day, but not the names of those people or other information that could be used to easily identify them. In addition, Find Your People asked non-profit organizations to complete a brief web-based survey that allowed me to collect additional information about the organization. I recruited non-profit groups working in the fields of Autism Spectrum Disorders and Human Organ Donation, respectively, to install the apps. Response rates to these requests were relatively high, and I identified minimal evidence of response bias. I used this tool to develop a new theory of how social media posts go viral. Readers who are interested in this theory– or those would would like to see more detail about how the apps were employed– can view this paper, this paper, this paper, or this paper.

How to Build Apps

When I created the Find Your People app, app-building required somewhat involved knowledge of programming in multiple languages, web design, and cloud computing. Yet the R program Shiny has become a gamechanger. Shiny is an interactive app building tool that you can use directly from rStudio. In addition to an easy to use, integrated app building tool, RStudio also provides a variety of tools to host and deploy apps on the web with the click of a button. Finally, there is a vibrant community of Shiny app developers— many of whom share the code they used to create their apps on sites such as this one.

There are a number of excellent tutorials online about how to use Shiny, including this excellent video series. Many of these are simple tools for interactive data visualization, yet Shiny enables development of apps for just about anything. Indeed, API calls can be embedded within Shiny apps to produce analyses of a user’s twitter data. Consider, for example, this nice example. Shiny allows you to create text boxes, multiple choice buttons, and many other of the standard fare of online surveys. Together, these tools could be used to create the functionality that I developed in the stone age with much less time and energy.

Building Bots for Social Science Research

Another recent trend in studies that employ digital trace data is the creation of bots, or automated social media accounts. In a very creative study, political science PhD Student Kevin Munger build an app designed to examine racial harrassment on Twitter. He created two automated accounts– one of which had a profile picture with a white person and the other with an African-American person. The bots were then designed to a) search Twitter for tweets by white men that contain racist language; and then b) reply to these tweets with condemnations of the racist language. Notwithstanding some limitations of the research design, this study suggests that people are more likely to stop using racist language if they are chastized by the bot with the White person pictured in its Twitter profile than the African-American person pictured in its Twitter profile.



In a recent study, my coauthors and I created bots designed to disrupt social media echo chambers in order to examine how they shape the formation of political views. We surveyed a large sample of Democrats and Republicans who visit Twitter at least three times each week about a range of social policy issues. One week later, we randomly assigned respondents to a treatment condition in which they were offered financial incentives to follow a Twitter bot for one month that exposed them to messages produced by elected officials, organizations, and other opinion leaders with opposing political ideologies. Respondents were re-surveyed at the end of the month to measure the effect of this treatment, and at regular intervals throughout the study period to monitor treatment compliance. We find that Republicans who followed a liberal Twitter bot became substantially more conservative post-treatment, and Democrats who followed a conservative Twitter bot became slightly more liberal post-treatment.



This study was somewhat more involved than the Munger study (and also much more expensive) since it combined not only observation of publicly-available Twitter data, but also linked these data to surveys of a large group of people. The bots were also somewhat more sophisticated in that they were created using a variety of sophisticated tools from social network analysis and the field of natural language processing. Still, these bots were not terribly difficult to create, and I describe the steps needed to create a bot such as ours in the following section.

How to Create a Bot

One way to create a bot is to write a .R script that is hosted on a single computer and runs throughout the study period. Yet such a strategy presents numerous obstacles. First, it ties up the R session on a machine and it therefore cannot be used for other routine work. Users have to either have another computer or a lot of time on their hands. Second, bots that are hosted on a single laptop or desktop can typically only be accessed or controlled from that machine. Third, all machines are prone to failure, and if the machine that hosts the app fails, valuable time can go buy during a field experiment before the researcher becomes aware of the failure.

For these reasons, I host the bots I’ve built in my work on a cloud machine running Rstudio via an Amazon EC2 server. This may sound complicated, but it is actually rather straightforward. The first step is to create an “Amazon Web Services” account. If you are a student, you may be eligible for 750 hours of free computing time. If you are not a student, you may be pleased to see that cloud computing time can be purchased quite inexpensively—particularly if you do not require significant computing power. The second step is to find a “Machine image” that will provide Amazon with instructions about how to create a cloud machine that can run RStudio. There are now many of these available, but one of the more popular ones is Louis Aslet’s. Click on the “region” closest to you to minimize latency (the time with which it takes for instructions to move from your location to the location of the Amazon server farm). You will also need to follow additional instructions on the aforementioned website in order to configure a “security group” and open up incoming HTTP traffic via port 80. You then cut and paste the “Public IP or DNS” address from your Amazon EC2 page into your web browser and you will be redirected to an Rstudio log-in page. By default, your user name and password are set to “Rstudio,” but you can change these immediately after loggin in (which is good practice to create additional security and prevent others from using your cloud machine). Keep in mind that as long as your machine is on or running, you will be charged by the hour, so make sure to shut down your cloud machines once you are done using them.

Regardless of where you host your bot, you will need to write some code to make it perform the functions necessary for your study. Below, I present code for a very primitive example of a bot that retweets a message about computational social science each hour for 24 days.

Note that this code assumes that you have already authenticated with Twitter in some manner. Be certain that your bot falls within Twitter’s terms of service, and make sure to avoid rate limiting (to learn about how to identify your rate limits, see my previous tutorial on Application Programming Interfaces.)

for (i in 1:24){
  #Search for 50 recent tweets about computational social science
  css_tweets<-search_tweets("Computational Social Science", n=50, include_rts = FALSE)
  #Randomly pick one of them, which appears in the `text` variable with the `css_tweets` dataframe
  lucky_tweet<-sample(css_tweets$text, 1)
  post_tweet(lucky_tweet)
  Sys.sleep(3600)
  #3600 seconds is 60 minutes
}

Needless to say, bots can be much more sophisticated– writing code to make bots interact with people—or sample candidates for interaction in real time—requires more code, though with a few “if/else” loops and a bit of elbow grease, this can often be accomplished in suprisingly few lines of code.

Most studies will also require data collection of those whom the bot interacts with. Such data can be collected within the bot code, or in a separate script that monitors accounts that are selected for interaction by a bot. Once again, make sure your bot does not conflict with Twitter’s terms of service, and does not abuse rate limits.

Ethical Issues in App and Bot-Based Research

Ethical issues in app and bot-based research are manifold and IRB guidelines for such research are in their infancy. For these reasons, it is vital— in my view—that computational social scientists hold themselves to a standard that is higher than IRB because there will no doubt be ethical issues on the horizons that we do not yet know exist. The Cambridge Analytica scandal may only be the tip of the ice-berg in terms of the potential for data collected by apps to be repurposed or merged with other datasets for other purposes. Such issues, combined with the perrenial issue of data security and the challenge of maintaining confidentiality with increasingly detailed data– or meta data–should inspire researchers to carefully review their plans not only with the IRB, but also with other members of the computational social science community who can help them ensure that their research is ethical and safe.