On the 29th of July 2018, Emma Best published on her website the copy of 11k+ wikileaks Twitter DM : https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/
Here is a data extraction and wrangling of this corpus, to make it easily searchable, extractable and sharable.
This page may not work as expected on Internet Explorer / Edge. Please switch to another browser if you have trouble reading this page.
As documented in the methodo, the DMConversationEntry have no date in the dataset, hence the date is inferred from the directly preceeding date, so these entries might not be correct when it comes to date.
Made once, saved in raw.txt.
library(tidyverse)
library(rvest)
doc <- read_html("https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/")
doc <- doc %>%
# Getting the p
html_nodes("p") %>%
html_text()
doc <- doc[! nchar(doc) == 0]
# Element 1 to 9 are the content of the blogpost, not the content of the conversation. Removing it:
doc <- doc[10:length(doc)]
write(doc, "raw.txt")
library(tidyverse)
res_raw <- read.delim("raw.txt", sep = "\n", header = FALSE) %>%
as.tibble() %>%
rename(value = V1)
In order to tidy the format, add DMConversationEntry
res <- res_raw %>%
mutate(value = str_replace_all(value,
"\\[DMConversationEntry\\]",
"<DMConversationEntry>"))
Removed, the last of the corpus:
res <- filter(res, ! str_detect(value, "931704226425856001"))
When a line n
doesn’t start with a date (middle of a DM), paste the content at the end of the line n-1.
Example with lines 93 & 94:
value |
---|
“[2015-05-02 14:12:27] |
“Never quite know who you’re dealing with online I guess. I don’t, anyway!” |
94 is paste at the end of 93 and removed.
for (i in nrow(res):1){
if (!grepl(pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry", res[i,])){
res[i-1,] <- paste(res[i-1,], res[i,])
}
}
# Remove column with no date
res <- res %>%
mutate(has_date = str_detect(value, pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry")) %>%
filter(has_date) %>%
select(-has_date)
res <- res %>%
extract(value,"user", regex = "<([a-zA-Z0-9 ]*)>", remove = FALSE) %>%
extract(value,"date", regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] .*", remove = FALSE) %>%
extract(value, "text", regex = "<[a-zA-Z0-9 ]*> (.*)", remove = FALSE) %>%
select(-value)
When date is missing, it’s because it’s a DMConversationEntry
.
res %>%
filter(user == "DMConversationEntry") %>%
summarize(nas = sum(is.na(date)),
nrow = n())
## # A tibble: 1 x 2
## nas nrow
## <int> <int>
## 1 20 20
We fill this with the directly preceeding date:
res <- fill(res, date)
write_csv(res, "wikileaks_dm.csv")
range(res$date)
## [1] "2015-05-01 13:52:11" "2017-11-10 04:30:46"
walk(2015:2017,
~ filter(res, lubridate::year(date) == .x) %>%
write_csv(glue::glue("{.x}.csv"))
)
walk(unique(res$user),
~ filter(res, user == .x) %>%
write_csv(glue::glue("user_{make.names(.x)}.csv"))
)
res %>%
count(user, sort = TRUE) %>%
write_csv("user_count.csv")
res %>%
mutate(date = lubridate::ymd_hms(date),
date = lubridate::date(date)) %>%
count(date) %>%
write_csv("daily.csv")
mentions <- res %>%
mutate(mention = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>%
unnest(mention) %>%
select(mention, everything())
write_csv(mentions, "mentions.csv")
mentions %>%
count(mention, sort = TRUE) %>%
write_csv("mentions_count.csv")
urls <- res %>%
mutate(url = str_extract_all(text, "http.+")) %>%
unnest() %>%
select(url, everything())
write_csv(urls, "urls.csv")
list.files(pattern = "csv") %>%
walk(function(x) {
o <- read_csv(x)
jsonlite::write_json(
o,
path = glue::glue("{tools::file_path_sans_ext(x)}.json")
)
})
dir.create("json")
list.files(pattern = "json") %>%
walk(function(x){
file.copy(x, glue::glue("json/{x}"))
unlink(x)
})