Collecting tweets with R and {rtweet}
Some arguments in favor of {rtweet}, an amazing package by Michael W. Kearney.
Note: this post has been update after recent changes of {rtweet} and {tidytext}. Check this post for more info.
For what I remember, I have been scraping web data for as long as I’ve been using R. This, of course, includes Google Analytics, Facebook, raw web pages, and so on and so forth — but most of all, I’ve always loved playing with Twitter data. I’ve been on Twitter for 8 years now, and I know what an insightful place it can be, and also, it’s full of data, so you know, why not going down that rabbit hole!
A long time ago, in a far away land
My first encounter with tweet collecting in R has been through the {twitteR} package, which, despite its (sometimes) weird interface and strange functions, has been really useful in many past projects.
So, before {rtweet}, the common practice was to mine Twitter with {twitteR}, which was designed as an object oriented package (so not tidy): calls on the API create environments, in which you can use methods for further investigation. Also, results were not in a data.frame format (even if there is a funtion to turn twitter lists to df).
Searching for tweets was quite straightforward: setting up oauth, search twitter, convert result to dataframe, dealing with dates… but a drawback was when doing text mining: you had to look for hashtags by subsetting the raw text messages of tweets. That could be kind of a headache. Not to mention when you needed to retrieve followers, timelines, and everything that makes Twitter a richful place to mine for data.
Why I used rtweet?
I’ve been using rtweet for several projects now, the more recent being:
which I wrote just after the first Breizh Data Day.
So, {rtweet} is a (relatively) young package, as it was first published on CRAN around a year ago (august 2016). It provides a tidy approach to collecting Twitter data: main functions return a data.frame following the tidy data principles.
A cool thing also, is that you don’t need to set up a Twitter app in order to search for tweets.
{rtweet} worklow
Note : please be sure to have last version of {rtweet} and {tidytext} (respectively 0.6 and 0.1.6)
packageVersion("rtweet")
[1] ‘0.6.0’
packageVersion("tidytext")
[1] ‘0.1.6’
So, let’s try searching for 1000 tweets with the hashtag Rennes with {rtweet}.
library(rtweet)
srch <- search_tweets("#Rennes", n = 1000)
Searching for tweets...
Finished collecting tweets!
dplyr::glimpse(srch)
Observations: 996
Variables: 42
$ status_id <chr> "953255590951899136", "953255475461677056", "953254988632993792", "9532549822157824...
$ created_at <dttm> 2018-01-16 13:20:18, 2018-01-16 13:19:50, 2018-01-16 13:17:54, 2018-01-16 13:17:53...
$ user_id <chr> "2737710932", "493396070", "276708676", "824240805158207489", "556902767", "3435794...
$ screen_name <chr> "AtonAha", "idelestre", "unidiversmag", "Rennes1jour", "M2bliquis", "OMAddictFR", "...
$ text <chr> "RT @Gendarmerie: #Bretagne Un trafic de #Stups démantelé par la SR de #Rennes et l...
$ source <chr> "Twitter Web Client", "Twitter Web Client", "social unidivers", "Rennes", "Twitter ...
$ reply_to_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ reply_to_user_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ reply_to_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ is_quote <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ is_retweet <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE,...
$ favorite_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 0,...
$ retweet_count <int> 86, 0, 0, 0, 3, 0, 2, 4, 0, 2, 2, 46, 5, 46, 0, 46, 46, 46, 46, 1, 1, 7, 10, 0, 46,...
$ hashtags <list> [<"Bretagne", "Stups", "Rennes", "Brest">, <"startup", "Rennes">, <"QUAND", "Renne...
$ symbols <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ urls_url <list> [NA, "l.leparisien.fr/s/SS6x", "unidivers.fr/quand-dieu-app…", "unidivers.fr/quand...
$ urls_t.co <list> [NA, "https://t.co/nEGvrKH3Jb", "https://t.co/QBVUlmd4DW", "https://t.co/rGPIycM8X...
$ urls_expanded_url <list> [NA, "http://l.leparisien.fr/s/SS6x", "https://www.unidivers.fr/quand-dieu-apprena...
$ media_url <list> [NA, NA, "http://pbs.twimg.com/media/DTqkW5sXcAEltYO.jpg", "http://pbs.twimg.com/m...
$ media_t.co <list> [NA, NA, "https://t.co/5bc1zrBGhh", "https://t.co/ztCCVEnvHO", NA, NA, NA, NA, NA,...
$ media_expanded_url <list> [NA, NA, "https://twitter.com/unidiversmag/status/953254988632993792/photo/1", "ht...
$ media_type <list> [NA, NA, "photo", "photo", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ ext_media_url <list> [NA, NA, "http://pbs.twimg.com/media/DTqkW5sXcAEltYO.jpg", "http://pbs.twimg.com/m...
$ ext_media_t.co <list> [NA, NA, "https://t.co/5bc1zrBGhh", "https://t.co/ztCCVEnvHO", NA, NA, NA, NA, NA,...
$ ext_media_expanded_url <list> [NA, NA, "https://twitter.com/unidiversmag/status/953254988632993792/photo/1", "ht...
$ ext_media_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ mentions_user_id <list> ["2184489764", "38142665", NA, NA, "863015893433028608", NA, "2273783713", "251269...
$ mentions_screen_name <list> ["Gendarmerie", "le_Parisien", NA, NA, "3508CentreNord", NA, "Skillo1989", "ren_mu...
$ lang <chr> "fr", "fr", "fr", "fr", "fr", "it", "fr", "fr", "fr", "fr", "fr", "fr", "fr", "fr",...
$ quoted_status_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ quoted_text <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ retweet_status_id <chr> "952976442488389636", NA, NA, NA, "953248780085874688", NA, "953253824545873920", "...
$ retweet_text <chr> "#Bretagne Un trafic de #Stups démantelé par la SR de #Rennes et la BR de #Brest : ...
$ place_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_full_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ country <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ country_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ geo_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
$ coords_coords <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
$ bbox_coords <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA, ...
And here it is. Yes, it’s that simple: 1000 tweets in two commands (and one being library()
): without specifying any oauth token.
So now we have a tidy data.frame with the results, hashtags are in a column we can easily parse and count, as are text, mentions, and everything needed to start mining data.
Quick example
Note: the recent version of {rtweet} returns a list column, which is not handled by
unnest_tokens
.
We can write a quick hashtag and sentiment analysis :
library(dplyr)
library(tidytext)
library(proustr)
as_vector(srch$hashtags) %>%
table() %>%
as.data.frame() %>%
arrange(Freq) %>%
top_n(5)
Selecting by Freq
. Freq
1 Stups 76
2 Brest 82
3 Bretagne 88
4 rennes 116
5 Rennes 889
srch %>%
unnest_tokens(word, text) %>%
select(word) %>%
left_join(proust_sentiments(type = "score")) %>%
na.omit() %>%
count(sentiment)
Joining, by = "word"
# A tibble: 6 x 2
sentiment n
<chr> <int>
1 anger 186
2 disgust 135
3 fear 279
4 joy 125
5 sadness 244
6 surprise 189
Of course {rtweet} is full of other features you can try. But that blog post was more of an introduction, and I hope I’ve convinced you to download and use this amazing package!
What do you think?