Text as Data Course
Chris Bail, PhD
Duke University
www.chrisbail.net
github.com/cbail
twitter.com/chris_bail
Application Programming Interfaces, or APIs, have become one of the most important ways to access and transfer data online— and increasingly APIs can even analyze your data as well. Compared to screen-scraping data, which is often illegal, logistically difficult (or both), APIs are a useful tool to make custom requests for data in manner that is well structured and considerably easier to work with than the HTML or XML data described in my previous tutorials on screenscraping. This tutorial assumes basic knowledge about R and other skills described in previous tutorials at the link above.
APIs are tools for building apps or other forms of software that help people access certain parts of large databases. Software developers can combine these tools in various ways—or combine them with tools from other APIs—in order to generate even more useful tools. Most of us use such apps each day. For example, if you install the Spotify app within your Facebook page to share music with your friends, this app is extracting data from Spotify’s API and then posting it to your Facebook page by communicating with Facebook’s API. There are countless examples of this on the internet at present— thanks in large part to the advent of Web 2.0, or the historical moment where the internet websites became became much more intertwined and dependent.
The number of APIs that are publicly available has expanded dramatically over the past decade, as the figure below shows. At the time of this writing, the website Programmable Web lists more than 19,638 APIs from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others. Though the core function of most APIs is to provide software developers with access to data, many APIs now analyze data as well. This might include facial recognition APIs, voice to text APIs, APIs that produce data visualizations, and so on.
In order to illustrate how an API works, it will be useful to start with a very simple one. Suppose we want to use the Google Maps API to geo-code a named entity— or tag the name of a place with latitude and longitude coordinates. The way that we do this, is to write a URL address that a) names the API; and b) includes the text of the query we want to make. If we Googled “Google Maps API Geocode” we would eventually be pointed towards the documentation for that API and learn that The base-URL for the Google Maps API is https://maps.googleapis.com. We want to use the geocoding function of this API, so we need a URL that points to this more specific part of the API: https://maps.googleapis.com/maps/api/geocode/json?address=. We can then add a named entity to the end of the URL such as “Duke” using text that looks something like this: follows: https://maps.googleapis.com/maps/api/geocode/json?address=Duke. This link (with some additional text that I will describe below) produces this output in a web browser:
What we are seeing is something called JSON data. Though it may look somewhat messy at first glance— lots of brackets, colons, commas, and indendation patterns—it is in fact very highly structured, and capable of storing complex types of data. Here, we can see that our request to geocode “Duke” not only identified the city within which it is located (Durham), but also the County, Country, and—towards the end of the page—the latitude and longitude data we were looking for. We will learn how to extract that piece of information later. The goal of the current discussion is to give you an idea of what an API is and how they work.
If we wanted to search for another geographic location, we could take the link above and replace “Duke” with the name of another place– try it out to give yourself a very rudimentary sense of how an API works.
Though anyone can make a request to the Google Maps API, getting data from Facebook’s API (which Facebook calls the “Graph” API) is considerably more difficult. This is because—with good reason—Facebook does not want a software developer to collect data about people whom they do not have a connection with on Facebook. In order to prevent this, Facebook’s Graph API—and many other APIs—require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access. To illustrate this further, let’s take a look at a tool Facebook built to help people learn about APIs. It’s called the Graph API explorer.
If you have a Facebook account—and if you were logged in—Facebook will generate credentials for you automatically in the form of something called an “Access Token.” In the screenshot above, this appears in a bar towards the lower bottom part of the screen. This code will give you temporary authorization to make requests from Facebook’s Graph API—but ONLY for data that you are allowed to access from your own Facebook page. If you click the blue “submit” button at the top right of the screen, you will see some output that contains your name and an ID that Facebook assigns to you. With some more effort, we could use this tool to make API calls to access our friend list, our likes, and so on, but for now, I’m simply trying to make the point that each person gets their own code that allows them to access some, but certainly not all, of the data on Facebook’s API. If I were to write in “cocacola” instead of “me” to get access to data posted by this business, I would get an error message suggesting that my current credentials do not give me access to that data.
Credentials may not only determine your access to people with whom you are connected on a social network, but also other privileges you may have vis-a-vis an API. For example, many APIs charge money for access to their data or services, and thus you will only receive your credentials after setting up an account. As we will see below, some sites also require you to have multiple types of credentials which can be described using a variety of verbiage such as “tokens”,“keys”, or “secrets.”
Before we make any more calls to APIs, we need to become familiar with an important concept called “Rate Limiting.” The credentials in the previous section not only define what type of information we are allowed to access, but also how often we are allowed to make requests for such data. These are known as “rate limits.” If we make too many requests for data within too short a period of time, an API will temporarily block us from collected data for a period of time that can range from 15 minutes to 24 hours or more, depending upon the API. Rate limiting is necessary so that APIs are not overwhelmed by too many requests that occur at the same time, which would slow down access to data for everyone. Rate limiting also enables large companies such as Google, Facebook, or Twitter, to prevent developers from collecting large amounts of data that could either compromise their user’s confidentiality or threaten their business model (since data has such immense value in today’s economy).
The exact timing of rate limiting is not always public, since knowing such time increments could enable developers to “game” the system and make rapid requests as soon as rate limiting has ended. Some APIs, however, allow you to make an API call or query in order to learn how many more requests you can make within a given time period before you are rate limited.
To illustrate the process of obtaining credentials and better understanding rate limiting, I will now present a worked example of how to obtain different types of data from the Twitter API. The first step in this process is to obtain credentials from Twitter that will allow you to make API calls.
Twitter, like many other websites, requires you to create an account in order to receive credentials. To do this, we need to visit https://apps.twitter.com. There, you will have to create a developer account by clicking “Apply for a developer account.” You may be asked to confirm your email address or add a mobile phone number because two-factor authentication helps Twitter prevent people from obtaining a large number of different credentials using multiple accounts that could be use to collect large amounts of data without being rate limited—or, for other nefarious purposes such as creating armies of bots that produce spam or attempt to influence elections.
Next, you will be asked whether you want to request a developer account for an organization or for personal use. You will most likely want to choose personal use (the organizational account can allow one person to request credentials that can be shared across a large group of people which could be useful if you work within a lab or other business).
Next, you will be asked a series of questions about how you want to use Twitter’s API. Unfortunately Twitter does not publish exact guidelines about who is allowed to use the API and why– as far as I know. That said, you can learn a lot by reading Twitter’s terms of service. Obvious red flags would include people who are hoping to build tools that somehow harasses Twitter users, hoards Twitter’s data (particularly for business purposes), ot other negative purposes. You will see these terms after you describe why you want to apply for credentials.
Once you accept the terms, your app developer request will go under review by Twitter. At the time of this writing, this process is rather time intensive– and with good reason, since Twitter is most likely employing large numbers of people to vet everyone who is applying for credentials right now. At the time of this writing, I’ve seen people get credentials within a day or two, and others who have waited more than a week. I even unfortunately know of several cases where people made multiple applications without any of the red flags above (and using different wording) that were ultimately rejected. Hopefully, yours will be approved (and if not, you might try mentioning your problem in a tweet to the @TwitterAPI on Twitter– this seems to have worked for several people that I know)
Once your developer account is approved, you can log in once again and click the “Create New App” button at the top right of the screen. Our goal is not to create a fully fledged app at this point, but simply to obtain the credentials necessary to begin making some simple calls to the Twitter API. You can name your app whatever you want, describe it however you want, and put in the name of any website you like. The two important things you must do are 1) put the following text in the “Callback URL” text box: http://127.0.0.1:1410
This number describes the location where the API will return your data– in this case, it is your web browser (but it could be another site where you want to store the results of the data.).
If you followed the steps above, the name of your application should now appear. Click on it, and then click on the “Keys and Access Tokens” tab in order to get your credentials. Unfortunately, Twitter makes developers get two different types of credentials which are listed on that page. These are blurred out in the screenshot below because I do not want people who read this web page to have access to my credneitals, which they could then abuse in various ways:
The next step is to define your credentials as string variables in R, which we will then use to authenticate ourselves with the Twitter API. Make sure to select the entire string (by triple clicking), and make sure that you do not accidentally leave out the first or last digit (or add spaces):
app_name = "YOURAPPNAMEHERE"
consumer_key = "YOURKEYHERE"
consumer_secret = "YOURSECRETHERE"
access_token = "YOURACCESSTOKENHERE"
access_token_secret = "YOURACCESSTOKENSECRETHERE"
Next, we are going to install an R package from Github called rtweet
that helps us make calls to Twitter’s API. More specifically, it provides a long list of functions that both a) construct API URL queries for different types of information; and b) parses the resulting data into neat formats. In order to authenticate you may also need to install the httpuv
package as well (if so, you will receive an error message about this package). If you have never installed a package from Github before you will need the devtools
package to do this.
library(devtools)
install_github("mkearney/rtweet")
Now, we are ready to authenticate ourselves vis-a-vis Twitter’s API. To do this, we are going to use rtweet
’s create_token
function, which makes an API call that passes the credentials we defined above, and then opens a web browser with an authentication dialogue that you must authorize by clicking the blue “authorize” button. You should then receive the following message Authentication complete. Please close this page and return to R.
library(rtweet)
twitter_token = create_token(
app = app_name,
consumer_key = consumer_key,
consumer_secret = consumer_secret,
access_token = access_token,
access_secret = access_secret,
set_renv = TRUE
)
Now, we can take full advantage of all of the many useful functions within the rtweet
function for collecting data from Twitter. Let’s begin by extracting 3,000 tweets that use the hashtag #korea.
korea_tweets<-search_tweets("#Korea", n=50, include_rts = FALSE)
This code creates a dataframe called korea_tweets
which we may then browse. Let’s take a look at the first ten tweets, which rtweet
stores as a variable called text
.
head(korea_tweets$text)
## [1] "<U+0436><U+0434><U+0430><U+043B><U+0430> <U+043A><U+0430><U+043B><U+0435><U+043D><U+0434><U+0430><U+0440><U+044C> <U+0431><U+043E><U+043B><U+044C><U+0448><U+0435> <U+0447><U+0435><U+043C> <U+0441><U+0430><U+043C> <U+0430><U+043B><U+044C><U+0431><U+043E><U+043C> <U+0001F602><U+0001F60D><U+2728>\n.\nhttps://t.co/QrXNsngQHG\n.\n.\n#straykids #<U+C2A4><U+D2B8><U+B808><U+C774><U+D0A4><U+C988> #BangChan \n#Leeknow #Changbin #Hyunjin #Han\n#Felix #Seungmin #I_N #kpop #korea #straykidsalbum #skzalbum #skzcollections #straykidscollection #clé_levanter https://t.co/HKNOUUvKkj"
## [2] "Seoul Korea #seoul #thisiseoul #weloveseoul #seoulonly #seoulfans #seoulgram #instaseoul #korea #southkorea #artofdestinations #passionpassport #destinationstotravel #photooftheday #beautifuldestinations #earthpics #amazingearth #seetheworld #lonelyplan… https://t.co/i7unomZln1 https://t.co/syMJszDbBy"
## [3] "Al fin puedo enseñaros mi primer #Chibi en #3D \nEspero que os guste, y que pronto pueda enseñaros alguno más...\n#Blender3d #Overwatch #blizzard #dva #mecha #pink #kawaii #ArtistOnTwitter #GG #anime #rabbit #gamer #korea #overwatch2 #fanart https://t.co/YyUO9NiuVR"
## [4] "36. Re-Founding holy law in #Korea:\nhttps://t.co/FQjWbJtval"
## [5] "Check out this job posted on https://t.co/dALvxcyETq(Teaching JOBS Korea starting ASAP to August South Korea) #korea #hiring #caedchat #ukedchat \n#TEFL #ESL #ELT #EFL #CELTA \n https://t.co/Hgle4gGmej"
## [6] "Excellent Top Quality Brand New Professional Titanium hair dressing scissors. The product is made to the highest specification and is quality tested ...\n\n#surgical #surgicalinstruments #surgicalbeauty #beautyproducts #barberscissor #scissor #America #london #korea #germany #japan https://t.co/RCZ7eoAD8P"
Note that this API call also generated a lot of other interesting variables, including the name and screen name of the user, the time of their post, and a variety of other metrics including links to media content and user profiles. A small number of users also enable geolocation of their tweets– and if that information is available it will appear in this dataset. Here is the full list of variables we collected via our API call above:
names(korea_tweets)
## [1] "user_id" "status_id"
## [3] "created_at" "screen_name"
## [5] "text" "source"
## [7] "display_text_width" "reply_to_status_id"
## [9] "reply_to_user_id" "reply_to_screen_name"
## [11] "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count"
## [15] "quote_count" "reply_count"
## [17] "hashtags" "symbols"
## [19] "urls_url" "urls_t.co"
## [21] "urls_expanded_url" "media_url"
## [23] "media_t.co" "media_expanded_url"
## [25] "media_type" "ext_media_url"
## [27] "ext_media_t.co" "ext_media_expanded_url"
## [29] "ext_media_type" "mentions_user_id"
## [31] "mentions_screen_name" "lang"
## [33] "quoted_status_id" "quoted_text"
## [35] "quoted_created_at" "quoted_source"
## [37] "quoted_favorite_count" "quoted_retweet_count"
## [39] "quoted_user_id" "quoted_screen_name"
## [41] "quoted_name" "quoted_followers_count"
## [43] "quoted_friends_count" "quoted_statuses_count"
## [45] "quoted_location" "quoted_description"
## [47] "quoted_verified" "retweet_status_id"
## [49] "retweet_text" "retweet_created_at"
## [51] "retweet_source" "retweet_favorite_count"
## [53] "retweet_retweet_count" "retweet_user_id"
## [55] "retweet_screen_name" "retweet_name"
## [57] "retweet_followers_count" "retweet_friends_count"
## [59] "retweet_statuses_count" "retweet_location"
## [61] "retweet_description" "retweet_verified"
## [63] "place_url" "place_name"
## [65] "place_full_name" "place_type"
## [67] "country" "country_code"
## [69] "geo_coords" "coords_coords"
## [71] "bbox_coords" "status_url"
## [73] "name" "location"
## [75] "description" "url"
## [77] "protected" "followers_count"
## [79] "friends_count" "listed_count"
## [81] "statuses_count" "favourites_count"
## [83] "account_created_at" "verified"
## [85] "profile_url" "profile_expanded_url"
## [87] "account_lang" "profile_banner_url"
## [89] "profile_background_url" "profile_image_url"
As a brief aside, the rtweet
function also interfaces nicely with ggplot and other visualization libraries to produce nice plots of the results above. For instance, let’s make a plot of the frequency of tweets about Korea over the past few days:
library(ggplot2)
ts_plot(korea_tweets, "3 hours") +
ggplot2::theme_minimal() +
ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
ggplot2::labs(
x = NULL, y = NULL,
title = "Frequency of Tweets about Korea from the Past Day",
subtitle = "Twitter status (tweet) counts aggregated using three-hour intervals",
caption = "\nSource: Data collected from Twitter's REST API via rtweet"
)
The search_tweets
function also has a number of useful options or arguments as well. For instance, we can restrict the geographic location of tweets to the United States and English-language tweets using the code below. The code also restricts the results to non-retweets and focuses upon the most recent tweets, rather than a mixture of popular and recent tweets, which is the defualt setting.
nk_tweets <- search_tweets("korea",
"lang:en", geocode = lookup_coords("usa"),
n = 1000, type="recent", include_rts=FALSE
)
rtweet
also enables one to geocode tweets for users who allow Twitter to track their location:
geocoded <- lat_lng(nk_tweets)
We can then plot these results as follows (you may need to install the maps
package to do this):
library(maps)
par(mar = c(0, 0, 0, 0))
maps::map("state", lwd = .25)
with(geocoded, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))
We don’t see all 100 tweets in this diagram for an important reason— we are only looking at people who allow Twitter to track their location (and this is roughly 1 in 100 people at the time of this writing).
Twitter’s API is also very useful for collecting data about a given user. Let’s take a look at Bernie Sander’s Twitter page: http://www.twitter.com/SenSanders. There we can see Senator Sander’s description and profile, the full text of his tweets, and—if we click several links—the names of the people he follows, those who follow him, and the tweets which he has “liked.”
First, let’s get his 5 most recent tweets:
sanders_tweets <- get_timelines(c("sensanders"), n = 5)
head(sanders_tweets$text)
## [1] "Mnuchin's economics experience? Being called the \"Foreclosure King\" for heading a bank that kicked 50,000 families out of their homes.\n\nKeep going, Greta. Scientists tell us we must transform our economy to save the planet. Economists say we'll create millions of jobs doing so. https://t.co/dkWOJadekP"
## [2] "Climate change is more likely to impact poor people and people of color first and worst. Yet the world's poorest people are least responsible for the changing climate.\n\nWe must transition our economy away from fossil fuels to fight growing inequality here and around the world. https://t.co/CovcckkJvl"
## [3] "Nearly half a century ago, the Supreme Court affirmed abortion as a constitutional right.\n\nI am sick and tired of conservatives who say, \"Get government off our backs,\" while telling women what they can do with their bodies.\n\nWomen get to control their bodies—not politicians. https://t.co/SjN5WD3Fbv"
## [4] "I congratulate Spain's government for declaring a climate emergency. The United States must now follow suit. \n\nMy resolution with @AOC and @RepBlumenauer mobilizes America to defeat the existential threat facing humanity. https://t.co/Eu8gBa4c2d"
## [5] "Strong unions are key to solving the crises facing higher education. I am delighted to join @repmarkpocan in this effort to protect graduate student workers who are organizing for their rights. https://t.co/jfyTDMfGc9"
Note that you are limited to requesting the last 3,200 tweets, so obtaining a complete database of tweets for a person who tweets very often may not be feasible, or you may need to purchase the data from Twitter itself:
Next, let’s get some broader information about Sanders using the lookup_users
function:
sanders_twitter_profile <- lookup_users("sensanders")
This creates a dataframe with a variety of additional variables. For example:
sanders_twitter_profile$description
## [1] "U.S. Senator Bernie Sanders of Vermont is the longest-serving independent in congressional history."
sanders_twitter_profile$location
## [1] "Vermont/DC"
sanders_twitter_profile$location
## [1] "Vermont/DC"
sanders_twitter_profile$followers_count
## [1] 9009344
We can also use the get_favorites()
function to identify the Tweets Sanders has recently “liked.”
sanders_favorites<-get_favorites("sensanders", n=5)
sanders_favorites$text
## [1] "@SenSanders Go Bernie <U+0001F618><U+0001F64F><U+2764><U+FE0F>"
## [2] "@SenSchumer Thank you @SenSanders for leading the way in opposing #USMCA for failing to combat or address the climate crisis — and thank you to @SenGillibrand @SenKamalaHarris @brianschatz @SenWhitehouse @SenJackReed @SenMarkey @SenSchumer for following suit!\n\nhttps://t.co/yHLkFrq0lF"
## [3] "The average student debt for black college graduates 3 years after graduation exceed $50K. Yes Senator Sanders cancel all the debt and make all public colleges and HBCUs tuition free https://t.co/gIQpuUfBpN https://t.co/n61AoTyhDv"
## [4] "Wall Street financing of corporate polluters since 2016:\n\nJPMorgan Chase: $196 billion \nWells Fargo: $151 billion\nBank of America: $106 billion\n\nWall Street is funding the destruction of our planet. https://t.co/MrjFuxPXRM"
## [5] "Yesterday my colleagues and I introduced the No War Against Iran Act, which would deny funds for unauthorized military action against Iran. \n\nCongress must act quickly so Trump doesn’t unilaterally take our country into another war."
We can also get a list of the people who Sanders follows like this:
sanders_follows<-get_followers("sensanders")
This produces the user IDs of those followers, and we could get more information about them if we want using the lookup_users
function. If we were interested in creating a larger social network analysis dataset centered around Sanders, we could scrape the followers of his followers within a loop.
Looping is an efficient way of collecting a large amount of data, but it will also trigger rate limiting. As I mentioned above, however, Twitter enables users to check their rate limits. The rate_limit()
function in the rtweets package does this as follows:
rate_limits<-rate_limit()
head(rate_limits[,1:4])
## # A tibble: 6 x 4
## query limit remaining reset
## <chr> <int> <int> <drtn>
## 1 lists/list 15 15 15.00576 mins
## 2 lists/memberships 75 75 15.00576 mins
## 3 lists/subscribers/show 15 15 15.00576 mins
## 4 lists/members 900 900 15.00576 mins
## 5 lists/subscriptions 15 15 15.00576 mins
## 6 lists/show 75 75 15.00576 mins
In the code above I created a dataframe that describes the total number of calls I can make within a given deadline (called reset). In this case, it is 15 minutes. In order to prevent rate limiting within a large loop, it is common practice to employ R’s Sys.sleep
function, which tells R to sleep for a certain number of seconds before proceeding to the next iteration of a loop.
rtweet
has a number of other useful functions which I will mention in case they might be useful to readers. get_trends()
will identify the trending topics on Twitter in a particular area:
get_trends("New York")
## # A tibble: 50 x 9
## trend url promoted_content query tweet_volume place woeid
## <chr> <chr> <lgl> <chr> <int> <chr> <int>
## 1 Zion http~ NA Zion 390673 New ~ 2.46e6
## 2 Eli ~ http~ NA %22E~ 107723 New ~ 2.46e6
## 3 Jeter http~ NA Jeter 47230 New ~ 2.46e6
## 4 Schi~ http~ NA Schi~ 1251160 New ~ 2.46e6
## 5 #Tha~ http~ NA %23T~ 48698 New ~ 2.46e6
## 6 #Imp~ http~ NA %23I~ 136903 New ~ 2.46e6
## 7 #RHO~ http~ NA %23R~ NA New ~ 2.46e6
## 8 Jim ~ http~ NA %22J~ 12751 New ~ 2.46e6
## 9 Prin~ http~ NA %22P~ 62611 New ~ 2.46e6
## 10 #Nat~ http~ NA %23N~ 16591 New ~ 2.46e6
## # ... with 40 more rows, and 2 more variables: as_of <dttm>, created_at <dttm>
rtweet
can even control your Twitter account. For example, you can post messages to your Twitter feed from R as follows:
post_tweet("I love APIs")
I have used this function in past work with bots. See for example, this paper.
Wrapping API calls within a Loop
Very often, one may wish to wrap API calls such as those we have made thus far into a loop to collect data about a long list of users. To illustrate this, let’s open a list of the Twitter handles of elected officials in the U.S. that I posted on my Github site:
#load list of twitter handles for elected officials
elected_officials<-read.csv("https://cbail.github.io/Senators_Twitter_Data.csv", stringsAsFactors = FALSE)
head(elected_officials)
## ï..bioguide_id party gender title birthdate firstname middlename lastname
## 1 C001095 R M Sen 5/13/77 Tom Cotton
## 2 G000562 R M Sen 8/22/74 Cory Gardner
## 3 M001169 D M Sen 8/3/73 Christopher S. Murphy
## 4 S001194 D M Sen 10/20/72 Brian Emanuel Schatz
## 5 Y000064 R M Sen 8/24/72 Todd C. Young
## 6 S001197 R M Sen 2/22/72 Benjamin Eric Sasse
## name_suffix state district senate_class
## 1 AR Junior Seat II
## 2 CO Junior Seat II
## 3 CT Junior Seat I
## 4 HI Senior Seat III
## 5 IN Junior Seat III
## 6 NE Junior Seat II
## website fec_id twitter_id
## 1 https://www.cotton.senate.gov H2AR04083 SenTomCotton
## 2 https://www.gardner.senate.gov H0CO04122 SenCoryGardner
## 3 https://www.murphy.senate.gov H6CT05124 senmurphyoffice
## 4 https://www.schatz.senate.gov S4HI00136 SenBrianSchatz
## 5 https://www.young.senate.gov H0IN09070 SenToddYoung
## 6 https://www.sasse.senate.gov/public S4NE00090 SenSasse
As you can see, the second column of this .csv file includes the Twitter “screen names” or handles we need to make API requests about each elected official. Let’s grab each official’s most recent 100 tweets, and combine them into a single large dataset of recent tweets by elected officials in the U.S.
#create empty container to store tweets for each elected official
elected_official_tweets<-as.data.frame(NULL)
for(i in 1:nrow(elected_officials)){
#pull tweets
tweets<-get_timeline(elected_officials$twitter_id[i], n=100)
#populate dataframe
elected_official_tweets<-rbind(elected_official_tweets, tweets)
#pause for five seconds to further prevent rate limiting
Sys.sleep(1)
#print number/iteration for debugging/monitoring progress
print(i)
}
This code would take some time to run, of course, since we are collected 100 tweets from 500 different people. You may also get rate limited, depending upon your previous activity and your current rate limits. If so, modify the length of the pause in the Sys.sleep
command above. You may also notice some error messages in your output– these could occur because Senators change their Twitter handle, or because they have an account but no tweets, or other such errors.
In case you don’t care to pull the data yourself, I’ve saved a copy of the data I produced in September 2018 that you can load as follows:
load(url("https://cbail.github.io/Elected_Official_Tweets.Rdata"))
Now that we’ve collected the data, let’s create a simple model that predicts the retweet count of tweets. To do this, we need to merge the dataset we read in from Github above with the new dataset we just produced (in order to look at attributes of each senator that might increase the reach of their tweets)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#rename twitter_id variable in original dataset in order to merge it with tweet dataset
colnames(elected_officials)[colnames(elected_officials)=="twitter_id"]<-"screen_name"
for_analysis<-left_join(elected_official_tweets, elected_officials)
## Joining, by = "screen_name"
Let’s inspect the outcome measure:
hist(for_analysis$retweet_count)
Next, because our data is so skewed, we could use something like a negative binomial regression model to examine the association between the various predictors in our data and the outcome. To do this we will use the glm.nb
function from the MASS
package:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
summary(glm.nb(favorite_count~
party+
followers_count+
statuses_count+
gender,
data=for_analysis))
## Warning: glm.fit: algorithm did not converge
## Warning in glm.nb(favorite_count ~ party + followers_count + statuses_count + :
## alternation limit reached
##
## Call:
## glm.nb(formula = favorite_count ~ party + followers_count + statuses_count +
## gender, data = for_analysis, init.theta = 0.2094007059, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8026 -1.0967 -0.7976 -0.4399 9.2466
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.844e+00 6.808e-02 100.528 < 2e-16 ***
## partyI -1.509e+00 1.888e-01 -7.995 1.29e-15 ***
## partyR -1.290e+00 5.080e-02 -25.391 < 2e-16 ***
## followers_count 1.568e-06 2.855e-08 54.912 < 2e-16 ***
## statuses_count -2.436e-05 5.789e-06 -4.208 2.58e-05 ***
## genderM -8.106e-02 6.270e-02 -1.293 0.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.2094) family taken to be 1)
##
## Null deviance: 13815 on 8898 degrees of freedom
## Residual deviance: 11408 on 8893 degrees of freedom
## (900 observations deleted due to missingness)
## AIC: 110818
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.20940
## Std. Err.: 0.00271
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -110804.18700
Unsurprisingly, senators who have more followers tend to produce tweets that have more retweets. Those that produce too many tweets (which Twitter calls “statuses”) have fewer retweets. Finally, Republicans appear to get fewer retweets than Democrats. This could result from the fact that Twitter users who follow politics–or at least Senators–are more likely to be democrats, or because Democrats produce tweets that are somehow more appealing, or any number of other confounding factors (so we probably shouldn’t read too much into this finding without a more careful analysis).
There is one more skill that will be useful for you to have in order to work with Twitter data. Very often, we want to track trends over time or subset our data according to different time periods. If we browse the variable that describes the time each tweet was created, however, we see that it is not in a format that we can easily work with in r:
head(for_analysis$created_at)
## [1] "2018-09-17 01:43:46 UTC" "2018-09-14 17:15:06 UTC"
## [3] "2018-09-14 16:56:42 UTC" "2018-09-14 16:56:42 UTC"
## [5] "2018-09-13 16:55:18 UTC" "2018-09-12 18:11:21 UTC"
To manage these types of string variables that describe dates, it is often very useful to convert them into a variable of class “date.” There are several ways to do this in R, but here is the way to do it using the as.Date
function in base R.
for_analysis$date<-as.Date(for_analysis$created_at, format="%Y-%m-%d")
head(for_analysis$date)
## [1] "2018-09-17" "2018-09-14" "2018-09-14" "2018-09-14" "2018-09-13"
## [6] "2018-09-12"
Now, we can subset the data using conventional techniques. For example, if we wanted to only look at tweets for August, we could do this:
august_tweets<-for_analysis[for_analysis$date>"2018-07-31"&
for_analysis$date<"2018-09-01",]
By now it is hopefully clear that APIs are an invaluable resource for collecting data from the internet. At the same time, it may also be clear that the process of obtaining credentials, avoiding rate limiting, and understanding the unique jargon employed by those who create each API can mean a lot of hours sifting through the documentation of an API—particularly where there are not well functioning R packages for interfacing with the API in question. If you have to develop your own custom code to work with an API— or if you need information that is not obtainable using functions within an R package, you may find it useful to browse the source code of the R functions we have discussed above in order to see where they pass the API query language necessary to produce the results we worked with above.
There are numerous databases that describe popular APIs on the web, including the aforementioned Programmable Web, but also a variety of crowd-source and user generated lists as well:
https://www.programmableweb.com/ https://github.com/toddmotto/public-apis https://apilist.fun/
The R OpenSci site also has a list of R packages that work with APIs:
https://ropensci.org/packages/
Happy coding!