Combining the new {rtweet} and {tidytext}

10 minute(s) read

About keeping your packages up to date.

I’ve received recently some mails and comments asking why the code in :

couldn’t be reproduced.

Spoiler: this is due to the behavior of {tidytext}, which doesn’t accept the output of the new {rtweet}. The problem is almost solved (well, it has shifted, as you’ll see below). I thought the answer to these comments and mails would also be the perfect occasion for me to talk a little bit about how to return to previous versions of a package (and also, to provide a little workaround about the current error thrown when trying to mine the hashtags column of the {rtweet} output).

New {rtweet} vs previous {tidytext}

The update of {rtweet} was published on CRAN on 2017-11-16, (so after I published the blogposts / slides I just mentioned). It comes with a new feature: list columns.

Problem is: you can’t pass a data frame containing list-columns to unnest_tokens in version 0.1.5 of {tidytext}. This was prevented by the third to fifth lines of tidytext:::unnest_tokens.data.frame:

if (any(!purrr::map_lgl(tbl, is.atomic))) {
  stop("unnest_tokens expects all columns of input to be atomic vectors (not lists)")
}

Getting back to a previous version of a package

To examplify this {rtweet} and {tidytext} issue, let’s go back in time to previous versions of theses packages. For this, we’ll use the install_version() function from {devtools}:

library(devtools)
install_version("rtweet", version = "0.4.0", repos = "http://cran.us.r-project.org", quiet = TRUE)
Downloading package from url: http://cran.us.r-project.org/src/contrib/Archive/rtweet/rtweet_0.4.0.tar.gz
Installing rtweet
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL '/private/var/folders/lz/thnnmbpd1rz0h1tmyzgg0mh00000gn/T/RtmpPiXNRI/devtools766408992de/rtweet'  \
  --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests 

* installing *source* package rtweet ...
** package rtweet correctement décompressé et sommes MD5 vérifiées
** R
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (rtweet)

install_version("tidytext", version = "0.1.5", repos = "http://cran.us.r-project.org", quiet = TRUE)
Downloading package from url: http://cran.us.r-project.org/src/contrib/Archive/tidytext/tidytext_0.1.5.tar.gz
Installing tidytext
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL '/private/var/folders/lz/thnnmbpd1rz0h1tmyzgg0mh00000gn/T/RtmpPiXNRI/devtools7664a041fa4/tidytext'  \
  --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests 

* installing *source* package tidytext ...
** package tidytext correctement décompressé et sommes MD5 vérifiées
** R
** data
*** moving datasets to lazyload DB
** inst
** tests
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (tidytext)

packageVersion("rtweet")
[1] 0.4.0
packageVersion("tidytext")
[1] 0.1.5
library(dplyr)
rtweets04 <- rtweet::search_tweets("#RStats", n = 10)
Searching for tweets...
Finished collecting tweets!
 glimpse(rtweets04)
Observations: 77
Variables: 35
$ screen_name                    <chr> "brian_leavell", "Pranesh___K", "OpenCageData", "haseebmahmud", "XihongLin"...
$ user_id                        <chr> "770639488645300225", "1445945130", "1404653376", "15199563", "893499404728...
$ created_at                     <dttm> 2018-01-16 12:51:36, 2018-01-16 12:51:16, 2018-01-16 12:50:56, 2018-01-16 ...
$ status_id                      <chr> "953248371606736896", "953248285497573376", "953248203708694528", "95324813...
$ text                           <chr> "RT @kateumbers: What to do if you have a compulsive data collating problem...
$ retweet_count                  <int> 1, 2, 2, 43, 6, 12, 12, 12, 43, 43, 6, 43, 43, 2, 43, 43, 6, 43, 0, 43, 43,...
$ favorite_count                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ is_quote_status                <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ quote_status_id                <chr> "953004640118874112", NA, "953229496819306496", NA, NA, NA, NA, NA, NA, NA,...
$ is_retweet                     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
$ retweet_status_id              <chr> "953179356574113793", "952989197148749824", "953234288530608129", "95310176...
$ in_reply_to_status_status_id   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ in_reply_to_status_user_id     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ in_reply_to_status_screen_name <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ lang                           <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en...
$ source                         <chr> "Twitter for Android", "Twitter Web Client", "Twitter Web Client", "Twitter...
$ media_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ media_url                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ media_url_expanded             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ urls_display                   <chr> NA, NA, NA, "buff.ly/2DyyumE", "rich-iannone.github.io/DiagrammeR/", "wp.me...
$ urls_expanded                  <chr> NA, NA, NA, "https://buff.ly/2DyyumE", "http://rich-iannone.github.io/Diagr...
$ mentions_screen_name           <chr> "kateumbers", "jasonbaik94", "ma_salmon dpprdan rOpenSci OpenCageData", "da...
$ mentions_user_id               <chr> "322411475", "1888111382", "2865404679 828915258211303424 342250615 1404653...
$ symbols                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ hashtags                       <chr> "r rstats meta ecology evolution data metaanalysis", "rstats", "rstats", "r...
$ coordinates                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_id                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_type                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_name                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ place_full_name                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ country_code                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ country                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ bounding_box_coordinates       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ bounding_box_type              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

With these versions, you could simply do this (as you can find in the slides):

rtweets04 %>% 
      tidytext::unnest_tokens(word, hashtags) %>% 
      select(screen_name, word) %>%
      slice(1:5)
# A tibble: 5 x 2
    screen_name      word
          <chr>     <chr>
1 brian_leavell         r
2 brian_leavell    rstats
3 brian_leavell      meta
4 brian_leavell   ecology
5 brian_leavell evolution

{rtweet} 0.6

For a while, there were an issue while trying to use the two packages together, as {rtweet} 0.6 was released on 2017-11-16, and {tidytext} 0.1.6 on 2018-01-07. When these two versions were used together, if you tried to put the {rtweet} result into unnest_tokens(), you got an error.

Let’s simulate this behavior by updating to {rtweet} 0.6, while staying at {tidytext} 0.1.5.

detach("package:rtweet")
install.packages("rtweet", repos = "http://cran.us.r-project.org", quiet = TRUE)
packageVersion("rtweet")
[1] 0.6.0
packageVersion("tidytext")
[1] 0.1.5

So if I try to do the exact same thing:

library(rtweet)
rtweets06 <- search_tweets("#RStats", n = 10)
Searching for tweets...
Finished collecting tweets!
 glimpse(rtweets06)
Observations: 9
Variables: 42
$ status_id              <chr> "953248371606736896", "953248285497573376", "953248203708694528", "9532481352031723...
$ created_at             <dttm> 2018-01-16 12:51:36, 2018-01-16 12:51:16, 2018-01-16 12:50:56, 2018-01-16 12:50:40...
$ user_id                <chr> "770639488645300225", "1445945130", "1404653376", "15199563", "893499404728053760",...
$ screen_name            <chr> "brian_leavell", "Pranesh___K", "OpenCageData", "haseebmahmud", "XihongLin", "sello...
$ text                   <chr> "RT @kateumbers: What to do if you have a compulsive data collating problem. #r #rs...
$ source                 <chr> "Twitter for Android", "Twitter Web Client", "Twitter Web Client", "Twitter for iPh...
$ reply_to_status_id     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ reply_to_user_id       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ reply_to_screen_name   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ is_quote               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
$ is_retweet             <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
$ favorite_count         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0
$ retweet_count          <int> 1, 2, 2, 43, 6, 12, 12, 12, 43
$ hashtags               <list> [<"r", "rstats", "meta", "ecology", "evolution", "data", "metaanalysis">, "rstats"...
$ symbols                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ urls_url               <list> [NA, NA, NA, "buff.ly/2DyyumE", "rich-iannone.github.io/DiagrammeR/", "wp.me/pMm6L...
$ urls_t.co              <list> [NA, NA, NA, "https://t.co/gpTEhFhfY5", "https://t.co/TZdttVMrTf", "https://t.co/4...
$ urls_expanded_url      <list> [NA, NA, NA, "https://buff.ly/2DyyumE", "http://rich-iannone.github.io/DiagrammeR/...
$ media_url              <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ media_t.co             <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ media_expanded_url     <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ media_type             <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ ext_media_url          <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ ext_media_t.co         <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ ext_media_expanded_url <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA]
$ ext_media_type         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ mentions_user_id       <list> ["322411475", "1888111382", <"2865404679", "828915258211303424", "342250615", "140...
$ mentions_screen_name   <list> ["kateumbers", "jasonbaik94", <"ma_salmon", "dpprdan", "rOpenSci", "OpenCageData">...
$ lang                   <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en"
$ quoted_status_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ quoted_text            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ retweet_status_id      <chr> "953179356574113793", "952989197148749824", "953234288530608129", "9531017606960619...
$ retweet_text           <chr> "What to do if you have a compulsive data collating problem. #r #rstats #meta #ecol...
$ place_url              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ place_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ place_full_name        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ place_type             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ country                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ country_code           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA
$ geo_coords             <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
$ coords_coords          <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <...
$ bbox_coords            <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA, ...

rtweets06 %>% 
    tidytext::unnest_tokens(word, text) %>% 
    select(screen_name, word) %>%
    slice(1:5)
Error in unnest_tokens.data.frame(., word, text) : 
  unnest_tokens expects all columns of input to be atomic vectors (not lists)

This doesn’t work because {rtweet} results now have list columns, which throws an error when we call unnest_tokens() from {tidytext} 0.1.5. The function checks if all the columns of the df are atomic.

The new {tidytext} version prevents this behavior, as you can pass a df containing list columns.

# I'm restarting R from RStudio here, to avaid internal error -3 in R_decompress1 error
.rs.restartR()
NULL

Restarting R session...

install.packages("tidytext", repos = "http://cran.us.r-project.org", quiet = TRUE)

  There is a binary version available but the source version is later:
         binary source needs_compilation
tidytext  0.1.5  0.1.6             FALSE

installing the source package tidytext

 
packageVersion("tidytext")
[1] 0.1.6

library(tidytext)
library(dplyr)
rtweets06 %>% 
      unnest_tokens(word, text) %>% 
      select(screen_name, word) %>%
      slice(1:5)
# A tibble: 5 x 2
    screen_name       word
          <chr>      <chr>
1 brian_leavell         rt
2 brian_leavell kateumbers
3 brian_leavell       what
4 brian_leavell         to
5 brian_leavell         do

This now works because {tidytext} no longer checks if all the columns are atomic.

Yet we’ve got an issue if we move to the hashtag column:

rtweets06 %>% 
      unnest_tokens(word, hashtag) %>% 
      select(screen_name, word) %>%
      slice(1:5)
Error in check_input(x) : 
  Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.

In fact, the problem has moved: the function now checks at a lower level: you can indeed use a df containing list-columns, but you can’t pass a list-column as input.

Workaround for hashtag column

So, as promised, here’s the workaround:

library(purrr)
as_vector(rtweets06$hashtags) %>% 
      table() %>% 
      as.data.frame() %>% 
      arrange(Freq) %>% 
      top_n(5)
Selecting by Freq
              . Freq
1          data    1
2       ecology    1
3     evolution    1
4          meta    1
5  metaanalysis    1
6             r    1
7       dataviz    2
8       ggplot2    2
9   DataScience    3
10       rstats    9

What do you think?