Chris Bail
Duke University
www.chrisbail.net

Introduction

This is the last in a series of tutorials designed to introduce quantitative text analysis in R. This tutorial focuses upon one of the newest methods in this field, called Text Networks.

What is a Text Network?

Network analysis refers to a family of methods that describe relationships between units of analysis. A network is comprised of nodes as well as the edges or connections between them. In a social network—such as the one in the figure below—nodes are often individual people, and edges describe friendships, affiliations, or other types of social relationships. A rich theoretical tradition in the social sciences describes how patterns of clustering within social networks—and an individual’s position within or between clusters— is associated with a remarkably wide range of outcomes including health, employment, education, and many others.

Though network analysis is most often used to describe relationships between people, some of the early pioneers of network analysis realized that it could also be applied to represent relationships between words. For example, one can represent a corpus of documents as a network where each node is a document, and the thickness or strength of the edges between them describes similarities between the words used in any two documents. Or, one can create a textnetwork where individual words are the nodes, and the edges between them describe the regularity with which they co-occur in documents.

There are multiple advantages to a network-based approach to automated text analysis. Just as clusters of social connections can help explain a range of outcomes, understanding patterns of connections between words helps identify their meaning in a more precise manner than the “bag of words” approaches discussed in earlier tutorials. Second, text networks can be built out of documents of any length, whereas topic models function poorly on short texts such as social media messages. Finally, there are an arguably more sophisticated set of techniques for identifying clusters within social networks than those employed in other automated text analysis techniques described in my earlier tutorials.

Two-mode networks

Before we move on to a working example, we will need to delve a little bit deeper into some terminology from network analysis—specifically, the concept of two-mode networks. To clarify this concept, let’s take the example of a network where the first node set is words found in US newspaper headlines on the day of the first moon landing (July 20, 1969), and the second node set is the newspapers themselves. The data would look something like this:

Here is a two-mode projection of this network. As you can see, edges are only drawn between newspapers and words (i.e. nodes belonging to different sets).

With some reshaping of the data, this two-mode network can be projected in either of its one-mode forms. That is, one can either create a network where newspapers are connected by their use of the same words, OR, words in all of the articles can be connected based upon their co-appearance in newspapers.

In text networks, one node set will always be comprised of the words found in the documents analyzed; the other node set can be the documents themselves (as above), or some other type of meta data about those documents (such as the author’s name or the date when the document was published or created).

The Textnets Package

The only R package presently available to implement text network techniques is the textnets package. The most current version of the textnets package is currently available on Github. To install textnets—or any other package hosted on Github— you will need the devtools package:

library(devtools)
install_github("cbail/textnets")

To use textnets you’ll also need other packages for network analysis, sentence parsing, text analysis, and visualization (install where necessary). I will describe how each of these tools are used in broad terms below, but we will not be working through the functions in these other packages—instead, textnets borrows functions from these packages to create text networks.

library(textnets)
library(dplyr)
library(Matrix)
library(tidytext)
library(stringr)
library(SnowballC)
library(reshape2)
library(phrasemachine)
library(igraph)
library(ggraph)
library(networkD3)

An overview of textnets

The textnets package provides the following functions:

  1. preparing texts for network analysis
  2. creating text networks
  3. detecting themes or “topics” within text networks
  4. visualizing text networks

We will work through each of these steps one-by-one with a working example in the following sections.

Preparing Texts

The textnets package requires text that is contained within a dataframe, where each row represents a document. The text of each document must be contained within a single column, but the dataframe can also include other columns that describe meta data such as the author’s name or date of publication.

Example: State of the Union Addresses

To get a better sense of this, let’s take a look at some sample data. We are going to create a text network using texts from the State of the Union Addresses by U.S. presidents. Each row in the dataframe we load below describes an address given by a president, and the year in which that address was made. The dataset also describes the president’s party affiliation.

The dataset is available via the sotu package. The following code binds the sotu text and meta data objects together to make a single dataframe, as required by textnets. On the second line of code, we make sure that the texts column of this variable sotu$sotu_text is a character vector

library(sotu)
sotu <- data.frame(cbind(sotu_text, sotu_meta), stringsAsFactors=FALSE)
sotu$sotu_text <- as.character(sotu$sotu_text)

The textnets package includes two functions to prepare texts for analysis. You will choose one or the other for your analysis. The PrepText function prepares texts for networks using all types of words, while the PrepTextNounPhrases prepares text for networks using only nouns and noun phrases. Users may prefer to create networks based on only nouns or noun phrases because previous studies have shown that such parts of speech are more useful in mapping the topical content of a text than other parts of speech, such as verbs or adjectives (e.g. Rule, Cointet, and Bearman 2015).

PrepText

Let’s begin with the PrepText function. This function requires the user to provide four inputs: 1) a dataframe that meets the requirements described above; 2) the name of a column within that dataframe containing the texts that the user would like to analyze in character format (specified via the textvar argument), and 3) a column within that dataframe describing the groups through which the words of those texts will be linked (specified via the groupvar argument). The groupvar argument is often some type of document identifier or the name of the author of the document (in this case a president), but it could be any other variable of interest as well (such as time). In network analysis terminology, the textvar and the groupvar are specifying the nodes sets of a two-mode network. Finally, 4) the PrepText function requires the user to specify which projection of the two-mode network should be created using the node_type argument. If one wishes to build a network where the nodes are words, node_type=words should be specified. If one wishes to build a network where nodes are the authors of documents or any other meta data, then node_type=groups should be used.

The PrepText function also includes three optional arguments. The remove_url argument eliminates any hyperlinks within the provided texts. The remove_stop_words function eliminates very common English-language words such as “and”, “the”, or “at.” The stem function reduces each term to its stem form. For example, the term “running” would become the term “run” if the user sets stem=TRUE. The remove_numbers argument, if set to TRUE, will remove numbers that are unattached to letters (i.e. it will remove “60” and “1960” but not “60s” or “1960s”, likewise “2” but not “2nd”, etc).

The output of the PrepText function is a dataframe in “tidytext” style, where each row of the dataframe describes a word, the document that it appears in, and its overall frequency within that document. If you need a refresher on tidytext format, see my earlier tutorial on Basic Text Analysis.

The following code prepares the State of the Union data for text network analysis, specifying that nodes will be presidents, and edges thus describe overlap of words used in their speeches. In this example we also remove stop words and stem.

sotu_text_data <- PrepText(sotu, textvar="sotu_text", groupvar="president", node_type="groups", remove_stop_words=TRUE)
save(sotu_text_data, file = "sotu_text_data.Rdata")

PrepTextNounPhrases

textnets also enables one to use sentence parsing in order to build text networks where the nodes are noun phrases. This can be advantageous since noun phrases often contain more of the content of interest than other parts of speech such as adjectives or verbs.

The syntax for using the PrepTextNounPhrases function is the same as the PrepText function. Nouns are identified using the phrasemachine package, which requires a version of Java >7, (or, with a bit of work you can also set it up to use the Spacy parser with a Python backend. Either way, the PrepTextNounPhrases package will take much longer than the PrepText function, because it must perform part-of-speech tagging on each sentence within each document in the provided dataframe. The amount of time will depend on both the number and the length of texts.

The PrepTextNounPhrases has an additional optional argument that is not in the PrepText function. This argument describes the length of phrases that should be used. The default n-gram length for noun phrases is set to 4, but the user may specify a different range using the max_ngram_length argument. This may be of particular importance to those interested in organizations such as universities or government agencies, as many are likely to have formal titles comprised of than 4 words. Additionally, while all nouns are included in the output, only the top 1,000 noun phrases are included, to prevent an over detection of nested terms (e.g. ‘the_President’, ‘the_President_of_the_United’, ‘the_President_of_the_United_States’, ‘the_President_of_the_United_States_of_America’). If the user wishes to extract all noun phrases instead of just the top 1,000, the top_phrases parameter should be set to FALSE.

# Depreciated function
# sotu_text_data_nouns <- PrepTextNounPhrases(sotu, "president", "sotu_text", node_type="groups", top_phrases=TRUE)

Creating Text Networks

The workhorse function within the textnets package is the CreateTextnet function. This function reads in an object created using the PrepText or PrepTextNounPhrases functions and outputs a weighted adjacency matrix, or a square matrix where the rows and columns correspond to either 1) the groups defined by the groupvar argument (if the user specificed node_type="group" in the previous stage), or 2) words (if the user specified node_type="words").

The numbers in the cells of the adjacency matrix are the sum of the term-frequency inverse-document frequency (TFIDF) for overlapping terms between two documents. This gives greater weight to words that appear less frequently in the document and less weight to those that are very common.

sotu_text_network <- CreateTextnet(sotu_text_data)

Analyzing Text Networks

In order to group documents according to their similarity– or in order to identify latent themes across texts– users may wish to cluster documents or words within text networks. The TextCommunities function applies the Louvain community detection algorithm to do this, which automatically uses the edge weights and determines the number of clusters within a given network. The function outputs a dataframe with the cluster or “modularity” class to which each document or word has been assigned.

sotu_communities <- TextCommunities(sotu_text_network)

In order to further understand which terms are driving the clustering of documents or words, the user can use the InterpretText function, which also reads in an object created by the CreateTextnet function and outputs the words with the 10 highest TFIDF frequencies within each cluster or modularity class. In order to match words, the function requires that the user specify the name of the text data frame object used to create the text network– in this case sotu_text_data (see above).

top_words_modularity_classes <- InterpretText(sotu_text_network, sotu_text_data)
## Joining, by = "group"

Centrality Measures

Often in social networks, researchers wish to calculate measures of influence or centrality in order to predict whether or not occupying brokerage positions can create greater social rewards for individuals. As Bail (2016) shows, the same logic can be applied to text networks to develop a measure of “cultural betweenness” or the extent to which a given document or word is between clusters. To calculate cultural betweennes as well as other centrality measures, textnet users can use the TextCentrality function.

text_centrality <- TextCentrality(sotu_text_network)

Visualizing Text Networks

Finally, the textnets package includes two functions to visualize text networks created in the previous steps. The VisualizeText function creates a network diagram where nodes are colored by their cluster or modularity class (see previous section). In many cases, text networks will be very dense (that is, there will be a very large number of edges because most documents share at least one word). Visualizing text networks therefore creates inherent challenges, because such dense networks are very cluttered. To make text networks more readable, the visualize function requires the user to specify a prune_cut argument, which specifies which quantile of edges should be kept for the visualization. For example, if the user sets prune_cut=.9 only edges that have a weight in the 90th percentile or above will be kept. The VisualizeText function also includes an argument that determines which nodes will be labeled, since network visualizations with too many node labels can be difficult to interpret. The user specifies an argument called label_degree_cut which specifies the degree, or number of each connections, that nodes which are labeled should have. For example, if the user only wants nodes that have at least 3 connections to other nodes to be labeled (and only wants to visualize edges with a weight that is greater than the 50th percentile), she or he would use the following code:

VisTextNet(sotu_text_network, .30, label_degree_cut=3)

The final function in the textnets package is the VisualizeTextD3js function. This function outputs an interactive javascript visualization of the text network, where the user can mouse over each node in order to reveal its node label. Once again, nodes are coloured by their modularity class, and the user must sepcify a prune_cutargument:

VisTextNetD3(sotu_text_network, .50)

To save this as an html file for sharing with others or in a presentation, the following can be used. The height and width parameters are set in pixels, and bound=TRUE will prevent the network from dispersing beyond these dimensions. While this may help viewers to see all nodes, it will also cause nodes to cluster at the limits of height and wigth. This can be prevented by increasing the charge parameters, which specifies the strength of node repulsion (negative value) or attraction (positive value). The zoom parameter indicates whether to allow users to zoom in and out of the network, which can be especially helpful in large networks for exploring clusters.

library(htmlwidgets)
vis <- VisTextNetD3(sotu_text_network, 
                      height=1000,
                      width=1400,
                      bound=FALSE,
                      zoom=TRUE,
                      charge=-30)
saveWidget(vis, "sotu_textnet.html")

References

Bail, Christopher A. 2016. “Combining Network Analysis and Natural Language Processing to Examine how Ad- vocacy Organizations Stimulate Conversation on Social Media.” Proceedings of the National Academy of Sciences, 113:42 11823-11828

Rule, Alix and Jean-Philippe Cointet and Peter Bearman. 2015. “Lexical shifts, substantive changes, and continuity in the State of the Union Discourse, 1790-2014.” Proceedings of the National Academy of Sciences