Analyzing Russian Trolls in R

In January 2017, The U.S. Congress published a list of Twitter accounts, which are connected to the Russian Internet Agency (aka “Russian Troll Factory”) according to Twitter itself. The American Intelligence Community believes that IRA has close ties to Russian intelligence services. We do not know the methods used by Twitter to identify the so called “trolls”. Because of this, we cannot be entirely sure that the accounts are controlled by employees in the Russian “Troll Factory”. However, the list is highly intriguing, considering that Twitter is putting its reputation on the line, by openly admitting that foreign governments are gaming the platform to influence the political situation in U.S. The U.S. District Court went as far as indicting Russian nationals for using the IRA online campaign to interfere with the U.S. presidential elections. Ironically, Twitter does not allow others to store the data about the published “troll” accounts, because they are now removed from the platform. Nevertheless, NBC News has published a data set consisting of nearly 200k tweets from some of these accounts. This post briefly explores the public data from NBC News using social network analysis in R and visualization in Gephi.

Getting Started in R

The first thing we need to is to load the necessary packages in R.

#loading packages
library("dplyr") #for data manipulation

## Warning: package 'dplyr' was built under R version 3.4.4

library("igraph") # for social network analysis

## Warning: package 'igraph' was built under R version 3.4.4

Then we load the data directly from the NBC News site.

tweets <- read.csv("http://nodeassets.nbcnews.com/russian-twitter-trolls/tweets.csv",
                   stringsAsFactors = F, sep = ",")

The next step is to extract relational information from the data set that will allow us construct a network of retweets. For this, we will need to see 1) who are retweeting (senders) and 2) who are being retweeted (receivers).

#selecting only the retweets
rts <- grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T, value=T)
# extracting handle names for the senders (those who retweet)
rt.sender <- tolower(as.character(tweets$user_key[grep("^rt @[a-z0-9_]{1,15}", tolower(tweets$text), perl=T)]))
# extracting handle names for the recievers (those who are being retweeted)
rt.receiver<- tolower(regmatches(rts, regexpr("@(?U).*:", rts)))
rt.receiver <- (gsub(":", "", rt.receiver)) #removing ":"
rt.receiver <- (gsub("@", "", rt.receiver)) #removing "@"
### Registering empty entries as missing
rt.sender[rt.sender==""] <- "<NA>"
rt.receiver[rt.receiver==""] <- "<NA>"
# a large proportion of this code is from <https://www.r-bloggers.com/generating-graphs-of-retweets-and-messages-on-twitter-using-r-and-gephi/>

Now that we can see who retweets who, we would also want to know whether the profiles can be labeled as “trolls”. We can do this by 1) making a full list of handle names in the retweet network and 2) by checking whether the users match with the identified “troll” accounts.

The first step is simple.

#storing reciever and sender handle names in one dataframe and removing duplicates
handle.all <- unique(as.data.frame(c(rt.sender, rt.receiver))) 
#renaming the handle names variable
handle.all <- handle.all %>% rename(handle = "c(rt.sender, rt.receiver)")

In the next step, we will combine the full list of “troll” handle names released by the Congress with “troll” handle names in the NBC News data. This will be our “troll” list. We then label all of the accounts in the retweet network as “trolls” if their handle names appear in the list. The code bellow is chunky, but it does the job one step at a time.

# importing handle names from the official list release in congress
trolls_official <-  read.csv("http://golovchenko.github.io/data/trollhandles.txt", stringsAsFactors = F)
# merging the complete list of official troll handle names with the ones in NBC data
tweets <- tweets %>% rename(handle = user_key) #renaming handle name variable
handles <- tweets %>% select(handle) #selecting only the handles from the data
handles <- rbind(trolls_official, handles)
handles.u <- unique(handles) #removing duplicates
handles.u$troll <- "troll" #assigning all of these users a trolls
### matching trolls with the complete set of handle names in the retweet network
nodes <- right_join(handles.u, handle.all)
nodes <- replace(nodes, is.na(nodes), "non-troll") # now we have a variable indicating wether a user is a troll

Now it is time to combine the sender and receiver information into a single data frame with two columns or, in this case, an “edge list”. The edge list allows us to see who retweets who and how many times.

### Creating a data frame from the sender-receiver objects
rts.df <- data.frame(rt.sender, rt.receiver)
### creating the retweetnetwork based on the sender-receiver df and the node attributes (troll/non-troll)
rts.g <- graph.data.frame(rts.df, directed=T, vertices = nodes)
### removing self-ties
rts.g <-simplify(rts.g, remove.loops = T, remove.multiple = F)

We then use the edge list to create a network object in igraph, which is essentially our retweet network.

### creating the retweetnetwork based on the sender-receiver df and the node attributes (troll/non-troll)
rts.g <- graph.data.frame(rts.df, directed=T, vertices = nodes)
### removing self-ties
rts.g <-simplify(rts.g, remove.loops = T, remove.multiple = F)

Now we can compute basic centrality scores for each user and store it in a data frame.

# removing multiple edges between users
g <- simplify(rts.g, remove.multiple = T, remove.loops = T)
# creating a data frame with weighted and unweighted degree centrality for each profile
df <- data.frame(name =V(g)$name,
                 troll= V(g)$troll,indegree=degree(g,mode='in'),
                 indegree_weighted = degree(rts.g, mode ="in"),
                 outdegree=degree(g,mode='out'),
                 outdegree_weighted = degree(rts.g, mode = "out"))
#ranking users by indegree
rank.indegree <- df %>% select(name, troll, indegree,
                          indegree_weighted) %>% arrange(-indegree)

## Warning: package 'bindrcpp' was built under R version 3.4.4

#ranking users b weigted indegree n users * n retweets
rank.indegree.w <- df %>% select(name, troll, indegree,
                          indegree_weighted) %>% arrange(-indegree_weighted)

The table below shows the top 10 profiles ranked by indegree. Note that the data only includes ‘ego-centric’ “troll” networks. In other words, the data only shows us who the “trolls” retweet and not vice versa. The data suggests that the “trolls” actively retweet both left-leaning and right-leaning profiles. The Hill, a slightly left-centered news outlet, has the highest indegree of 102, which means that it has been retweeted by 102 IRA accounts. Wen examining the weighted indegree, i.e. the number of times a profile has been retweeted by “trolls”, we see that the 102 profiles have retweeted The Hill 358 times in total. The reader must note that the list of top 10 users with the highest indegree also includes a profile belonging to Fox News, Hillary Clinton, and Donald Trump.

library(knitr)

## Warning: package 'knitr' was built under R version 3.4.4

kable(rank.indegree[1:10,], caption = "Top 10 profiles ranked by indegree")

Top 10 profiles ranked by indegree
name	troll	indegree	indegree_weighted
thehill	non-troll	102	358
realdonaldtrump	non-troll	100	544
wikileaks	non-troll	82	247
blicqer	non-troll	69	2207
hillaryclinton	non-troll	61	98
joyannreid	non-troll	58	267
prisonplanet	non-troll	56	462
jamilsmith	non-troll	55	118
ten_gop	troll	53	430
foxnews	non-troll	53	336

The next graph shows top 10 users ranked by weighted indegree. The highest ranking profile, blicqer, has been retweeted 2207 times by 69 trolls.

kable(rank.indegree.w[1:10,], caption = "Top 10 profiles ranked by weighted indegree")

Top 10 profiles ranked by weighted indegree
name	troll	indegree	indegree_weighted
blicqer	non-troll	69	2207
conservatexian	non-troll	30	1082
realdonaldtrump	non-troll	100	544
nine_oh	non-troll	11	500
prisonplanet	non-troll	56	462
zaibatsunews	non-troll	16	451
gerfingerpoken	non-troll	46	434
ten_gop	troll	53	430
bizpacreview	non-troll	17	401
beforeitsnews	non-troll	8	399

Blicqer is user who writes about political issues in U.S. from an African-American perspective in a relatively left-leaning manner. Alongside this profile, we find conservatexian who described himself as an “Overeducated conservative cowboy” in his self-description on Twitter. This pattern reemerges in other data as well. In 2017, Facebook has released a set of adds, which the company has, according to it’s own report, sold to IRA. The two images below are examples of these posts. Here too, IRA has targeted both the the far-right, the far-left and supporters of African-American activism.

Adds bought by the IRA

The next step is to visualize the network in R, by including only the trolls. We see that 1) many of the “trolls” do retweet each other, 2) while most of those that do, are a part of a large connected component.

### subsetting the graph by removing non-trolls
#selecting nodes to exclude
exclude <- V(rts.g)[troll == "non-troll"]
#excluding the nodes
g.troll <- delete.vertices(rts.g, exclude)

### vizualizing the graph
par(bg ="grey10")
plot.igraph(g.troll,layout= layout.fruchterman.reingold(g.troll),
            edge.color="grey",
            edge.curved= .2, vertex.label = NA, vertex.frame.color="#ffffff",
            vertex.size = 2, edge.size = 0.01, edge.arrow.size = 0.01)

When zooming in on the largest component, it becomes more clear that the “trolls” are positioned in different clusters. A “troll” may interact frequently with one group of “trolls”, while having little or no interaction with other groups.

#decomposing the graph into components and returning the largest one
comp <- decompose(g.troll, mode = c("weak"), max.comps = 1,
                  min.vertices = 1)
### plotting the graph
par(bg ="grey10")
plot.igraph(comp[[1]],layout= layout.fruchterman.reingold(comp[[1]]),
            edge.color="grey",
            edge.curved= .2, vertex.label = NA, vertex.frame.color="#ffffff",
            vertex.size = 4, edge.size = 0.005, edge.arrow.size = 0.01)

##Analyzing the Retweet Network in Gephi In our next and final step, we will explore the political diversity in the NBC News troll data by visually analyzing the full network of retweets (including both “trolls and”non-trolls"). R is a useful tool for restructuring the data and for computing network metrics. However, visualizing large graphs in R can be both a relatively slow and frustrating process. For this reason, we will visualize the retweet network in Gephi, an open source software which can be downloaded and explored here. The network in R can be exported in a Gephi-friendly graphml format using one line of code:

#exporting the rts.g graph object as a graphml file 
write.graph(rts.g, file="troll_network.graphml", format="graphml")

After importing the graphml file in Gephi and using Force Layout 2 algorithm, we may acquire a network that looks like the one below. Nodes represent users, while an edge between two nodes is established if one user has retweeted the other. Node and label size reflects indegree centrality, whereas color reflects network “communities” that have been identified by Gephi’s inbuilt community detection algorithm, also known as Louvain modularity method. Here we see three major clusters.

Retweet Network

When zooming in on the cluster below, we see that the most central users here are primarily right-winged (e.g. Trump and Breitbart News). Many of these accounts are popular among the more extreme libertarians or conspiracy theorists (WikiLeaks, RT, Alex Jones and Paul Joseph Watson from InfoWars). These accounts often portray themselves as alternative to either mainstream, “globalist” or center-left media.

Retweet Network: Zooming in (Part 1)

We see the opposite pattern when zooming in on users in the cluster to the left. Here, we find primarily left-leaning users, African-American activists or mainstream media outlets, that are contested by the central nodes in the cluster to the right.

Retweet Network: Zooming in (Part 2)

The third largest cluster consists of less known and more ambiguous profiles, which is why it should be explored in a separate post on its own. A close analysis of the two largest clusters alone suggests a possible division of labor between the trolls. Some profiles are specialized in retweeting right-leaning content, whereas others amplify views from the political left. The simple visual analysis is in itself not sufficient, since a lot more can be learned if one digs deeper into the data. Hopefully this brief data post will give you a starting point for future analysis in R.

Analyzing Russian Trolls in R

Yevgeniy Golovchenko

25 April 2018

Getting Started in R