Down the rabbit hole with tidy eval — Part 1

7 minute(s) read

Some random explanations about programming with tidy eval.

What on earth is evaluation?

So, let’s start with a simple question: what is evaluation? Evaluation is the process of analyzing an expression, in order to give the user something back. For example, in R, the standard evaluations is :

  • you type/send something to the console (called a symbol)
  • press enter
  • R does some magic stuffs
  • R returns you the value associated with the expression

For example :

# You type 1, the expression
1
# R evaluates 1, and returns you 
[1] 1

a <- 1
# Here, the expression is a (a is the symbol)
# Standard eval: when a symbol is evaluated, it return its value
a 
[1] 1

Pretty clear isn’t it?

Spoiler: the part about R doing magic stuffs wasn’t quite true. In fact, R takes the symbol you’ve entered (here a), turns it into and internal representation, then looks in the direct environment of the expression in order to return the value associated with it. If R doesn’t find the value in the environment the expression is linked to, it goes up to the parent env, then to the parent env, so on and so forth.

This is R standard evaluation. The returned object is the value the symbol is linked to. Keep this in mind, you’ll need this later.

Aside: about lazy evalution

An R strength is lazy evaluation. These strange words mean that R only evaluates the expression if the expression is actually used. That’s why this kind of function works:

lazy <- function(a, b){
  print("please take a nap")
}
lazy()
[1] "please take a nap"

lazy <- function(a, b){
  print(a)
}
lazy("please take a nap")
[1] "please take a nap"

Here in function 1, a and b are not evaluated in the environment of the function, so no error. In function 2, b is never called, so it’s not evaluated, and no error is thrown either. On the other hand, this doesn’t work:

lazy <- function(a, b){
  print(a)
  print(b)
}
lazy("please take a nap")
[1] "please take a nap"
Error in print(b) : 
  argument "b" is missing, with no default

Here, you can see that it throws an error: b is needed. You can also notice that a is first evaluated, the strings are printed, and only then the missing b throws an error.

About scoping

Quick thing to keep in mind here, the notion of environment. Each expression is by default evaluated in its environment. Then if it’s missing, R goes up to its parent env, then to the parent env, etc.

Each function defines its own environment, which can have its own rules (so basically its own rule for evaluation of a symbol). The env opened when the function is launched and closed when finished. That’s why you can’t directly access the object created inside a function :

create <- function(){
  a <- 1
}

create()
a
> Error: could not find 'a'

# Special character to override this  

create <- function(){
  a <<- 1
}
create()
a
[1] 1

# But please DON'T do that. 

Let’s focus: what about tidy eval?

So, back to our original point. I’ve been diving into tidy eval lately as I’ve been contributing to {narnia}, a package designed to analyse missing data, the tidy way. The whole philosophy of the package being the tidyverse, I needed to contribute with the same philosophy in mind.

So basically, I needed to create a function that took a df, the unquoted name x of a column, and dplyr::group_by with this column, and then ggplot::ggplot, with aes(x), the name of the column previously specified. Thing is, you can’t simply do :

# Note : this is obviously not the function I was working on. This is an example.
# 
# So you want to turn this into a function : 
library(tidyverse)
iris %>% 
  group_by(Species) %>% 
  slice(5:10) %>% 
  ggplot(aes(Species, Sepal.Length)) + 
  geom_point() 

# Let's try the simple way

gg_top <- function(df, col_group, col_plot){
  df %>%
    group_by(col_group) %>% 
    slice(5:10) %>% 
    ggplot(aes(col_group, col_plot)) + 
    geom_point()  
}

gg_top(df = iris, col_group = Species, col_plot = Sepal.Length)

Error in grouped_df_impl(data, unname(vars), drop) : 
  Column `col_group` is unknown

OK. Here R simply can’t find col_group. But where is this coming from? I did specified that col_group was equal to Species. Why is it looking for col?

Let’s try something else.

# This works 
select(iris, Species)

# So what if I want to reproduce it?
# I can think of 

select_custom <- function(df, col){
  df[, col]
}
select_custom(df = iris, col = Species)
> Error in `[.data.frame`(df, , col) : could not find 'Species' 

# But this works: 
select_custom(df = iris, col = "Species")

God damn, how is it that dplyr::select works with unquoted element, while select_custom needs a quoted string? That’s because :

  • select_custom uses the standard evaluation: R sees the symbol Species, and tries to evaluate the standard way — i.e. by looking in the environment of the function for the value of Species. It doesn’t find it, so throws an error.
  • When "Species" is quoted, R evaluates it for what it is: a string. So R doesn’t try to return a value from it.
  • dplyr::select creates an environment, which has a custom method of evaluation. This is why you can pass unquoted string there — R will not look computer the symbol looking for a value in the env.

In each dplyr::function(df, var), every var is evaluated in the environment of the function, which have special way of computed symbols. In the case of filter, R looks for a column named var in df (in practice, that’s not exactly how it works, but you get the point).

This explains the error being thrown earlier: group_by was looking for the col_group column inside our data.frame.

Getting started

Then, the big question: how can we program with dplyr? How can we pass the unquoted Species arg from the function gg_top to our group_by, and Sepal.Length to the ggplot? Let’s start by breaking our problem into two parts: the dplyr, then the ggplot.

So first, we need to create a function that takes a data.frame, makes a group_by on a column, then returns the slice(5:10). Basically something doing:

iris %>%
  group_by(Species) %>%
  slice(5:10)

# We could think of 
slicer <- function(df, var){
  df %>%
  group_by(var) %>%
  slice(5:10)
}
slicer(df = iris, var = Species)

> Error in grouped_df_impl(data, unname(vars), drop) : 
  Column `var` is unknown

Here, you can see that R is looking for a var column. That’s because var is evaluated in the environment created by group_by, so looking for the column var in the iris df. So how to prevent that?

We could think of:

slicer(df = iris, var = "Species")

> Error in grouped_df_impl(data, unname(vars), drop) : 
  Column `var` is unknown

But 1: that’s not working (because group_by doesn’t take a string), 2: we don’t want to quote.

So the thing is: dplyr functions work with a special type of objects, called quosure — this is how symbols are evaluated. You can create them with quo().

quo(Species)
<quosure: global>
~Species

# So is this going to work? 
slicer <- function(df, var){
  df %>%
  group_by(quo(var)) %>%
  slice(5:10)
}
slicer(df = iris, var = Species)
Error in mutate_impl(.data, dots) : 
  Column `quo(var)` is of unsupported type quoted call

Nop! Obviously here, group_by(quo(var)) compute quo(var) as a quosure, so it does:

quo(quo(var))
<quosure: frame>
~quo(var)

Not what we’ve been looking for either. We need a way to prevent the symbol var from being evaluated the standard way, but evaluated with tidy eval. Good news, there’s a function for that — enquo(). This function :

  • Takes a symbol
  • quotes the R code supplied
  • captures the environment
  • returns a quosure

Then, we need a way to tell group_by that we’ve taken care to the “quosurisation” (that’s not the real word, you know!). So… here comes !! (to be pronounced “Bang Bang” :) )

slicer <- function(df, var){
  enquo_var <- enquo(var)
  df %>%
    # !! tells dplyr not to compute the object as a quosure
  group_by(!!enquo_var) %>%
  slice(5:10)
}
# That works!
slicer(df = iris, var = Species)

[emoji party]

the ggplot part

So now, we need to pass the col_group and col_plot into the ggplot call. We may be tempted to pass !!enquo_col_plot the same way we passed it through group_by. Thing is: tidy eval is not yet implemented in ggplot2 — so you can’t pass the enquo(var) to it.

gg_top <- function(df, col_group, col_plot){
  enquo_col_group <- enquo(col_group)
  enquo_col_plot <- enquo(col_plot)
  df %>%
  group_by(!!enquo_col_group) %>%
    slice(5:10) %>% 
    ggplot(aes(!!enquo_col_group, !!enquo_col_plot)) + 
    geom_point()  
}

gg_top(df = iris, col_group = Species, col_plot = Sepal.Length)

Error in (function (x)  : could not find 'enquo_var' 

The trick is: you can use quo_name, which returns a character string with the name of the expression you’ve typed. Pass it to ggplot2::aes_string… and Voilà!

gg_top <- function(df, col_group, col_plot){
  enquo_col_group <- enquo(col_group)
  enquo_col_plot <- enquo(col_plot)  
  df %>%
  group_by(!!enquo_col_group) %>%
    slice(5:10) %>% 
    ggplot(aes_string(quo_name(enquo_col_group), quo_name(enquo_col_plot))) + 
    geom_point()  
}

gg_top(df = iris, col_group = Species, col_plot = Sepal.Length)

[emoji party]^2

Sorry, that was quite a long post.. I hope it has enlightened some dark side of the tidyverse :)

Coming soon: more on tidy eval, environment, and computing on the R language.

What do you think?