Wikileaks Twitter DM

About

On the 29th of July 2018, Emma Best published on her website the copy of 11k+ wikileaks Twitter DM : https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/

Here is a data extraction and wrangling of this corpus, to make it easily searchable, extractable and sharable.

How to use this page

  • Every “link.csv” is a downloadable csv.
  • You can search and order every table. Results of the search are downloadable as csv or can be copied in the clipboard.
  • You can zoom in the time series by selecting the date range. You can also use the selector beside to choose this range. Double click to reset the settings.
  • Under each dynamic plot, you can find a static plot by clicking on “Static plost”.

This page may not work as expected on Internet Explorer / Edge. Please switch to another browser if you have trouble reading this page.

Data format

  • Every csv is encoded in UTF8
  • You can find these csv in JSON format on the GitHub repo

Browse through the content

  • Home has the full dataset, to search and download.
  • Timeline has a series of time-related content: notably DMs by years, and daily count of DMs.
  • Users holds the dataset for each users.
  • mentions_urls holds the extracted mentions and urls
  • methodo contains the methodology used for the data wrangling

About DMConversationEntry

As documented in the methodo, the DMConversationEntry have no date in the dataset, hence the date is inferred from the directly preceeding date, so these entries might not be correct when it comes to date.

Raw to csv

Downloading

Made once, saved in raw.txt.

library(tidyverse)
library(rvest)
doc <- read_html("https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/")
doc <- doc %>% 
  # Getting the p
  html_nodes("p") %>%
  html_text()

doc <- doc[! nchar(doc)  == 0]
# Element 1 to 9 are the content of the blogpost, not the content of the conversation. Removing it:

doc <- doc[10:length(doc)]
write(doc, "raw.txt")

Getting data as a table

library(tidyverse)
res_raw <- read.delim("raw.txt", sep = "\n", header = FALSE) %>% 
  as.tibble() %>% 
  rename(value = V1)

In order to tidy the format, add as author when the text is a DMConversationEntry

res <- res_raw %>% 
  mutate(value = str_replace_all(value, 
                                 "\\[DMConversationEntry\\]",
                                 "<DMConversationEntry>")) 

Removed, the last of the corpus:

  • “[LatestTweetID] 931704226425856001”, line 22751 of the original document
res <- filter(res, ! str_detect(value, "931704226425856001"))

When a line n doesn’t start with a date (middle of a DM), paste the content at the end of the line n-1.

Example with lines 93 & 94:

value
“[2015-05-02 14:12:27] OK, thanks H. Security issues were about who was on the list then?”
“Never quite know who you’re dealing with online I guess. I don’t, anyway!”

94 is paste at the end of 93 and removed.

for (i in nrow(res):1){
  if (!grepl(pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry", res[i,])){
    res[i-1,] <- paste(res[i-1,], res[i,])
  }
}
# Remove column with no date
res <- res %>% 
  mutate(has_date = str_detect(value, pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry")) %>%
  filter(has_date) %>%
  select(-has_date)

Extract key elements

res <- res %>%
  extract(value,"user", regex = "<([a-zA-Z0-9 ]*)>", remove = FALSE) %>%
  extract(value,"date", regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] .*", remove = FALSE) %>%
  extract(value, "text", regex = "<[a-zA-Z0-9 ]*> (.*)", remove = FALSE) %>%
  select(-value)

When date is missing, it’s because it’s a DMConversationEntry.

res %>% 
  filter(user == "DMConversationEntry") %>%
  summarize(nas = sum(is.na(date)), 
            nrow = n())
## # A tibble: 1 x 2
##     nas  nrow
##   <int> <int>
## 1    20    20

We fill this with the directly preceeding date:

res <- fill(res, date)

Save

Global

write_csv(res, "wikileaks_dm.csv")

Year

range(res$date)
## [1] "2015-05-01 13:52:11" "2017-11-10 04:30:46"
walk(2015:2017, 
    ~ filter(res, lubridate::year(date) == .x) %>%
    write_csv(glue::glue("{.x}.csv"))
    )

User

walk(unique(res$user), 
    ~ filter(res, user == .x) %>%
    write_csv(glue::glue("user_{make.names(.x)}.csv"))
    )

Counting users participation

res %>%
  count(user, sort = TRUE) %>%
  write_csv("user_count.csv")

Counting activity by days

res %>%
  mutate(date = lubridate::ymd_hms(date), 
         date = lubridate::date(date)) %>% 
  count(date) %>%
  write_csv("daily.csv")

Adding extra info

mentions <- res %>% 
  mutate(mention = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>%
  unnest(mention) %>% 
  select(mention, everything())
write_csv(mentions, "mentions.csv")

mentions %>%
  count(mention, sort = TRUE) %>%
  write_csv("mentions_count.csv")
urls <- res %>% 
  mutate(url = str_extract_all(text, "http.+")) %>%
  unnest() %>% 
  select(url, everything())
write_csv(urls, "urls.csv")

Adding JSON format

list.files(pattern = "csv") %>%
  walk(function(x) {
    o <- read_csv(x)
    jsonlite::write_json(
      o,
      path = glue::glue("{tools::file_path_sans_ext(x)}.json")
    )
  })
dir.create("json")
list.files(pattern = "json") %>%
  walk(function(x){
    file.copy(x, glue::glue("json/{x}"))
    unlink(x)
  })