Collecting tweets with R and {rtweet}

6 minute(s) read

Some arguments in favor of {rtweet}, an amazing package by Michael W. Kearney.

Note: this post has been update after recent changes of {rtweet} and {tidytext}. Check this post for more info.

For what I remember, I have been scraping web data for as long as I’ve been using R. This, of course, includes Google Analytics, Facebook, raw web pages, and so on and so forth — but most of all, I’ve always loved playing with Twitter data. I’ve been on Twitter for 8 years now, and I know what an insightful place it can be, and also, it’s full of data, so you know, why not going down that rabbit hole!

A long time ago, in a far away land

My first encounter with tweet collecting in R has been through the {twitteR} package, which, despite its (sometimes) weird interface and strange functions, has been really useful in many past projects.

So, before {rtweet}, the common practice was to mine Twitter with {twitteR}, which was designed as an object oriented package (so not tidy): calls on the API create environments, in which you can use methods for further investigation. Also, results were not in a data.frame format (even if there is a funtion to turn twitter lists to df).

Searching for tweets was quite straightforward: setting up oauth, search twitter, convert result to dataframe, dealing with dates… but a drawback was when doing text mining: you had to look for hashtags by subsetting the raw text messages of tweets. That could be kind of a headache. Not to mention when you needed to retrieve followers, timelines, and everything that makes Twitter a richful place to mine for data.

Why I used rtweet?

I’ve been using rtweet for several projects now, the more recent being:

#BreizhDataDay, revue de Tweets

which I wrote just after the first Breizh Data Day.

So, {rtweet} is a (relatively) young package, as it was first published on CRAN around a year ago (august 2016). It provides a tidy approach to collecting Twitter data: main functions return a data.frame following the tidy data principles.

A cool thing also, is that you don’t need to set up a Twitter app in order to search for tweets.

{rtweet} worklow

Note : please be sure to have last version of {rtweet} and {tidytext} (respectively 0.6 and 0.1.6)

packageVersion("rtweet")
[1] ‘0.6.0’
packageVersion("tidytext")
[1] ‘0.1.6’

So, let’s try searching for 1000 tweets with the hashtag Rennes with {rtweet}.

library(rtweet)
srch <- search_tweets("#Rennes", n = 1000)
Searching for tweets...
Finished collecting tweets!
 dplyr::glimpse(srch)
Observations: 996
Variables: 42
$ status_id              <chr> "953255590951899136", "953255475461677056", "953254988632993792", "9532549822157824...
$ created_at             <dttm> 2018-01-16 13:20:18, 2018-01-16 13:19:50, 2018-01-16 13:17:54, 2018-01-16 13:17:53...
$ user_id                <chr> "2737710932", "493396070", "276708676", "824240805158207489", "556902767", "3435794...
$ screen_name            <chr> "AtonAha", "idelestre", "unidiversmag", "Rennes1jour", "M2bliquis", "OMAddictFR", "...
$ text                   <chr> "RT @Gendarmerie: #Bretagne Un trafic de #Stups démantelé par la SR de #Rennes et l...
$ source                 <chr> "Twitter Web Client", "Twitter Web Client", "social unidivers", "Rennes", "Twitter ...
$ reply_to_status_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ reply_to_user_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ reply_to_screen_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ is_quote               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ is_retweet             <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE,...
$ favorite_count         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0, 0,...
$ retweet_count          <int> 86, 0, 0, 0, 3, 0, 2, 4, 0, 2, 2, 46, 5, 46, 0, 46, 46, 46, 46, 1, 1, 7, 10, 0, 46,...
$ hashtags               <list> [<"Bretagne", "Stups", "Rennes", "Brest">, <"startup", "Rennes">, <"QUAND", "Renne...
$ symbols                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ urls_url               <list> [NA, "l.leparisien.fr/s/SS6x", "unidivers.fr/quand-dieu-app…", "unidivers.fr/quand...
$ urls_t.co              <list> [NA, "https://t.co/nEGvrKH3Jb", "https://t.co/QBVUlmd4DW", "https://t.co/rGPIycM8X...
$ urls_expanded_url      <list> [NA, "http://l.leparisien.fr/s/SS6x", "https://www.unidivers.fr/quand-dieu-apprena...
$ media_url              <list> [NA, NA, "http://pbs.twimg.com/media/DTqkW5sXcAEltYO.jpg", "http://pbs.twimg.com/m...
$ media_t.co             <list> [NA, NA, "https://t.co/5bc1zrBGhh", "https://t.co/ztCCVEnvHO", NA, NA, NA, NA, NA,...
$ media_expanded_url     <list> [NA, NA, "https://twitter.com/unidiversmag/status/953254988632993792/photo/1", "ht...
$ media_type             <list> [NA, NA, "photo", "photo", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ ext_media_url          <list> [NA, NA, "http://pbs.twimg.com/media/DTqkW5sXcAEltYO.jpg", "http://pbs.twimg.com/m...
$ ext_media_t.co         <list> [NA, NA, "https://t.co/5bc1zrBGhh", "https://t.co/ztCCVEnvHO", NA, NA, NA, NA, NA,...
$ ext_media_expanded_url <list> [NA, NA, "https://twitter.com/unidiversmag/status/953254988632993792/photo/1", "ht...
$ ext_media_type         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ mentions_user_id       <list> ["2184489764", "38142665", NA, NA, "863015893433028608", NA, "2273783713", "251269...
$ mentions_screen_name   <list> ["Gendarmerie", "le_Parisien", NA, NA, "3508CentreNord", NA, "Skillo1989", "ren_mu...
$ lang                   <chr> "fr", "fr", "fr", "fr", "fr", "it", "fr", "fr", "fr", "fr", "fr", "fr", "fr", "fr",...
$ quoted_status_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ quoted_text            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ retweet_status_id      <chr> "952976442488389636", NA, NA, NA, "953248780085874688", NA, "953253824545873920", "...
$ retweet_text           <chr> "#Bretagne Un trafic de #Stups démantelé par la SR de #Rennes et la BR de #Brest : ...
$ place_url              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_type             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ country                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ country_code           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ geo_coords             <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
$ coords_coords          <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
$ bbox_coords            <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA, ...

And here it is. Yes, it’s that simple: 1000 tweets in two commands (and one being library()): without specifying any oauth token.

So now we have a tidy data.frame with the results, hashtags are in a column we can easily parse and count, as are text, mentions, and everything needed to start mining data.

Quick example

Note: the recent version of {rtweet} returns a list column, which is not handled by unnest_tokens.

We can write a quick hashtag and sentiment analysis :

library(dplyr)
library(tidytext)
library(proustr)

as_vector(srch$hashtags) %>% 
          table() %>% 
          as.data.frame() %>% 
          arrange(Freq) %>% 
          top_n(5)
Selecting by Freq
         . Freq
1    Stups   76
2    Brest   82
3 Bretagne   88
4   rennes  116
5   Rennes  889

srch %>% 
   unnest_tokens(word, text) %>%
   select(word) %>%
   left_join(proust_sentiments(type = "score")) %>%
   na.omit() %>%
   count(sentiment)
Joining, by = "word"
# A tibble: 6 x 2
  sentiment     n
      <chr> <int>
1     anger   186
2   disgust   135
3      fear   279
4       joy   125
5   sadness   244
6  surprise   189

Of course {rtweet} is full of other features you can try. But that blog post was more of an introduction, and I hope I’ve convinced you to download and use this amazing package!

Share on

Twitter Facebook Google+ LinkedIn

Colin FAY