Collecting tweets with R and {rtweet}

5 minute read

Some arguments in favor of {rtweet}, an amazing package by Michael W. Kearney.

For what I remember, I have been scraping web data for as long as I’ve been using R. This, of course, includes Google Analytics, Facebook, raw web pages, and so on and so forth — but most of all, I’ve always loved playing with Twitter data. I’ve been on Twitter for 8 years now, and I know what an insightful place it can be, and also, it’s full of data, so you know, why not going down that rabbit hole!

A long time ago, in a far away land

My first encounter with tweet collecting in R has been through the {twitteR} package, which, despite its (sometimes) weird interface and strange functions, has been really useful in many past projects.

So, before {rtweet}, the common practice was to mine Twitter with {twitteR}, which was designed as an object oriented package (so not tidy): calls on the API create environments, in which you can use methods for further investigation. Also, results were not in a data.frame format (even if there is a funtion to turn twitter lists to df).

Searching for tweets was quite straightforward: setting up oauth, search twitter, convert result to dataframe, dealing with dates… but a drawback was when doing text mining: you had to look for hashtags by subsetting the raw text messages of tweets. That could be kind of a headache. Not to mention when you needed to retrieve followers, timelines, and everything that makes Twitter a richful place to mine for data.

Why I used rtweet?

I’ve been using rtweet for several projects now, the more recent being:

which I wrote just after the first Breizh Data Day.

So, {rtweet} is a (relatively) young package, as it was first published on CRAN around a year ago (august 2016). It provides a tidy approach to collecting Twitter data: main functions return a data.frame following the tidy data principles.

A cool thing also, is that you don’t need to set up a Twitter app in order to search for tweets.

{rtweet} worklow

So, let’s try searching for 1000 tweets with the hashtag Rennes with {rtweet}.

library(rtweet)
srch <- search_tweets("#Rennes", n = 1000)
Searching for tweets...
Finished collecting tweets!
  
dplyr::glimpse(srch)
Observations: 1,000
Variables: 35
$ screen_name                    <chr> "LaurentBerthet1", "LoudL", "AnneRH_", "SAddictfr", "Girondins...
$ user_id                        <chr> "1718082409", "127659284", "878929995602821121", "869930134391...
$ created_at                     <dttm> 2017-11-01 09:11:41, 2017-11-01 09:11:23, 2017-11-01 09:10:26...
$ status_id                      <chr> "925651548222492672", "925651473035350017", "92565123359099289...
$ text                           <chr> "RT @GaspardGlanz: Grand merci à Maitre @VincentFillola pour s...
$ retweet_count                  <int> 2, 2, 0, 0, 0, 5, 0, 4, 10, 0, 0, 2, 3, 0, 1, 0, 0, 0, 0, 4, 4...
$ favorite_count                 <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 5, 0, 5, 0, 1, 0, 1, 0, 0, 0, 0, 0,...
$ is_quote_status                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ quote_status_id                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ is_retweet                     <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALS...
$ retweet_status_id              <chr> "925649093329915905", "925649093329915905", NA, NA, NA, "92534...
$ in_reply_to_status_status_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "925646918...
$ in_reply_to_status_user_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "573995912...
$ in_reply_to_status_screen_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "LeclercFl...
$ lang                           <chr> "fr", "fr", "fr", "fr", "fr", "fr", "fr", "fr", "fr", "pt", "f...
$ source                         <chr> "Twitter for iPhone", "Twitter Lite", "IFTTT", "SRFC Addict 2"...
$ media_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "925649225068765184", "925...
$ media_url                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "http://pbs.twimg.com/medi...
$ media_url_expanded             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "https://twitter.com/Laure...
$ urls                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ urls_display                   <chr> NA, NA, "twitter.com/i/web/status/9…", "football-addict.com/ar...
$ urls_expanded                  <chr> NA, NA, "https://twitter.com/i/web/status/925282923615543297",...
$ mentions_screen_name           <chr> "GaspardGlanz VincentFillola", "GaspardGlanz VincentFillola", ...
$ mentions_user_id               <chr> "53798973 514620241", "53798973 514620241", NA, NA, NA, "29779...
$ symbols                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ hashtags                       <chr> "Rennes", "Rennes", "RH Digital parking Gare Rennes EuroRennes...
$ coordinates                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ place_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ place_type                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ place_name                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ place_full_name                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ country_code                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ country                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ bounding_box_coordinates       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
$ bounding_box_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...


And here it is. Yes, it’s that simple: 1000 tweets in two commands (and one being library()): without specifying any oauth token.

So now we have a tidy data.frame with the results, hashtags are in a column we can easily parse and count, as are text, mentions, and everything needed to start mining data.

Quick example

We can write a quick hashtag and sentiment analysis :

library(dplyr)
library(tidytext)
library(proustr)

unnest_tokens(srch, hashtags, hashtags) %>%
  filter(hashtags != "rennes") %>%
  count(hashtags) %>%
  top_n(5)
Selecting by n
# A tibble: 5 x 2
hashtags     n
<chr> <int>
1    bretagne    53
2         cdi    50
3      emploi    36
4     quimper    44
5 visitrennes    48

unnest_tokens(srch, word, text) %>%
  select(word) %>%
  left_join(proust_sentiments(type = "score")) %>%
  na.omit() %>%
  count(sentiment)
Joining, by = "word"
# A tibble: 6 x 2
sentiment     n
<chr> <int>
1     anger   204
2   disgust   147
3      fear   211
4       joy    89
5   sadness   145
6  surprise   208

Of course {rtweet} is full of other features you can try. But that blog post was more of an introduction, and I hope I’ve convinced you to download and use this amazing package!

Categories:

Updated:

Leave a Comment