Chris Bail
Duke University
www.chrisbail.net
This is part of a series of tutorials I’ve written on collecting digital trace data from sites such as Twitter, Facebook, or other internet sources. These earlier tutorials demonstrated some of the potential of digital trace data, but also highlighted many limitations of these new wellsprings of data as well. For example, digital trace data are often incomplete, inaccessible, non-representative, unstructured–and thus difficult to work with– and sensitive in nature. Because of these limitations, there is growing consensus that hybrid approaches are needed that combine digital trace data with more conventional methods such as surveys.
Note: this tutorial is a work in progress. It will be updated soon to include more annotated code
When I created the Find Your People app, app-building required somewhat involved knowledge of programming in multiple languages, web design, and cloud computing. Yet the R program Shiny has become a gamechanger. Shiny is an interactive app building tool that you can use directly from rStudio. In addition to an easy to use, integrated app building tool, RStudio also provides a variety of tools to host and deploy apps on the web with the click of a button. Finally, there is a vibrant community of Shiny app developers— many of whom share the code they used to create their apps on sites such as this one.
There are a number of excellent tutorials online about how to use Shiny, including this excellent video series. Many of these are simple tools for interactive data visualization, yet Shiny enables development of apps for just about anything. Indeed, API calls can be embedded within Shiny apps to produce analyses of a user’s twitter data. Consider, for example, this nice example. Shiny allows you to create text boxes, multiple choice buttons, and many other of the standard fare of online surveys. Together, these tools could be used to create the functionality that I developed in the stone age with much less time and energy.
One way to create a bot is to write a .R script that is hosted on a single computer and runs throughout the study period. Yet such a strategy presents numerous obstacles. First, it ties up the R session on a machine and it therefore cannot be used for other routine work. Users have to either have another computer or a lot of time on their hands. Second, bots that are hosted on a single laptop or desktop can typically only be accessed or controlled from that machine. Third, all machines are prone to failure, and if the machine that hosts the app fails, valuable time can go buy during a field experiment before the researcher becomes aware of the failure.
For these reasons, I host the bots I’ve built in my work on a cloud machine running Rstudio via an Amazon EC2 server. This may sound complicated, but it is actually rather straightforward. The first step is to create an “Amazon Web Services” account. If you are a student, you may be eligible for 750 hours of free computing time. If you are not a student, you may be pleased to see that cloud computing time can be purchased quite inexpensively—particularly if you do not require significant computing power. The second step is to find a “Machine image” that will provide Amazon with instructions about how to create a cloud machine that can run RStudio. There are now many of these available, but one of the more popular ones is Louis Aslet’s. Click on the “region” closest to you to minimize latency (the time with which it takes for instructions to move from your location to the location of the Amazon server farm). You will also need to follow additional instructions on the aforementioned website in order to configure a “security group” and open up incoming HTTP traffic via port 80. You then cut and paste the “Public IP or DNS” address from your Amazon EC2 page into your web browser and you will be redirected to an Rstudio log-in page. By default, your user name and password are set to “Rstudio,” but you can change these immediately after loggin in (which is good practice to create additional security and prevent others from using your cloud machine). Keep in mind that as long as your machine is on or running, you will be charged by the hour, so make sure to shut down your cloud machines once you are done using them.
Regardless of where you host your bot, you will need to write some code to make it perform the functions necessary for your study. Below, I present code for a very primitive example of a bot that retweets a message about computational social science each hour for 24 days.
Note that this code assumes that you have already authenticated with Twitter in some manner. Be certain that your bot falls within Twitter’s terms of service, and make sure to avoid rate limiting (to learn about how to identify your rate limits, see my previous tutorial on Application Programming Interfaces.)
for (i in 1:24){
#Search for 50 recent tweets about computational social science
css_tweets<-search_tweets("Computational Social Science", n=50, include_rts = FALSE)
#Randomly pick one of them, which appears in the `text` variable with the `css_tweets` dataframe
lucky_tweet<-sample(css_tweets$text, 1)
post_tweet(lucky_tweet)
Sys.sleep(3600)
#3600 seconds is 60 minutes
}
Needless to say, bots can be much more sophisticated– writing code to make bots interact with people—or sample candidates for interaction in real time—requires more code, though with a few “if/else” loops and a bit of elbow grease, this can often be accomplished in suprisingly few lines of code.
Most studies will also require data collection of those whom the bot interacts with. Such data can be collected within the bot code, or in a separate script that monitors accounts that are selected for interaction by a bot. Once again, make sure your bot does not conflict with Twitter’s terms of service, and does not abuse rate limits.
Ethical issues in app and bot-based research are manifold and IRB guidelines for such research are in their infancy. For these reasons, it is vital— in my view—that computational social scientists hold themselves to a standard that is higher than IRB because there will no doubt be ethical issues on the horizons that we do not yet know exist. The Cambridge Analytica scandal may only be the tip of the ice-berg in terms of the potential for data collected by apps to be repurposed or merged with other datasets for other purposes. Such issues, combined with the perrenial issue of data security and the challenge of maintaining confidentiality with increasingly detailed data– or meta data–should inspire researchers to carefully review their plans not only with the IRB, but also with other members of the computational social science community who can help them ensure that their research is ethical and safe.