Down the rabbit hole with tidy eval — Part 1
Some random explanations about programming with tidy eval.
What on earth is evaluation?
So, let’s start with a simple question: what is evaluation? Evaluation is the process of analyzing an expression, in order to give the user something back. For example, in R, the standard evaluations is :
- you type/send something to the console (called a symbol)
- press enter
- R does some magic stuffs
- R returns you the value associated with the expression
For example :
# You type 1, the expression
1
# R evaluates 1, and returns you
[1] 1
a <- 1
# Here, the expression is a (a is the symbol)
# Standard eval: when a symbol is evaluated, it return its value
a
[1] 1
Pretty clear isn’t it?
Spoiler: the part about R doing magic stuffs wasn’t quite true. In fact, R takes the symbol you’ve entered (here a
), turns it into and internal representation, then looks in the direct environment of the expression in order to return the value associated with it. If R doesn’t find the value in the environment the expression is linked to, it goes up to the parent env, then to the parent env, so on and so forth.
This is R standard evaluation. The returned object is the value the symbol is linked to. Keep this in mind, you’ll need this later.
Aside: about lazy evalution
An R strength is lazy evaluation. These strange words mean that R only evaluates the expression if the expression is actually used. That’s why this kind of function works:
lazy <- function(a, b){
print("please take a nap")
}
lazy()
[1] "please take a nap"
lazy <- function(a, b){
print(a)
}
lazy("please take a nap")
[1] "please take a nap"
Here in function 1, a
and b
are not evaluated in the environment of the function, so no error. In function 2, b
is never called, so it’s not evaluated, and no error is thrown either. On the other hand, this doesn’t work:
lazy <- function(a, b){
print(a)
print(b)
}
lazy("please take a nap")
[1] "please take a nap"
Error in print(b) :
argument "b" is missing, with no default
Here, you can see that it throws an error: b
is needed. You can also notice that a
is first evaluated, the strings are printed, and only then the missing b
throws an error.
About scoping
Quick thing to keep in mind here, the notion of environment. Each expression is by default evaluated in its environment. Then if it’s missing, R goes up to its parent env, then to the parent env, etc.
Each function defines its own environment, which can have its own rules (so basically its own rule for evaluation of a symbol). The env opened when the function is launched and closed when finished. That’s why you can’t directly access the object created inside a function :
create <- function(){
a <- 1
}
create()
a
> Error: could not find 'a'
# Special character to override this
create <- function(){
a <<- 1
}
create()
a
[1] 1
# But please DON'T do that.
Let’s focus: what about tidy eval?
So, back to our original point. I’ve been diving into tidy eval lately as I’ve been contributing to {narnia}, a package designed to analyse missing data, the tidy way. The whole philosophy of the package being the tidyverse, I needed to contribute with the same philosophy in mind.
So basically, I needed to create a function that took a df
, the unquoted name x
of a column, and dplyr::group_by
with this column, and then ggplot::ggplot
, with aes(x)
, the name of the column previously specified. Thing is, you can’t simply do :
# Note : this is obviously not the function I was working on. This is an example.
#
# So you want to turn this into a function :
library(tidyverse)
iris %>%
group_by(Species) %>%
slice(5:10) %>%
ggplot(aes(Species, Sepal.Length)) +
geom_point()
# Let's try the simple way
gg_top <- function(df, col_group, col_plot){
df %>%
group_by(col_group) %>%
slice(5:10) %>%
ggplot(aes(col_group, col_plot)) +
geom_point()
}
gg_top(df = iris, col_group = Species, col_plot = Sepal.Length)
Error in grouped_df_impl(data, unname(vars), drop) :
Column `col_group` is unknown
OK. Here R simply can’t find col_group
. But where is this coming from? I did specified that col_group
was equal to Species
. Why is it looking for col
?
Let’s try something else.
# This works
select(iris, Species)
# So what if I want to reproduce it?
# I can think of
select_custom <- function(df, col){
df[, col]
}
select_custom(df = iris, col = Species)
> Error in `[.data.frame`(df, , col) : could not find 'Species'
# But this works:
select_custom(df = iris, col = "Species")
God damn, how is it that dplyr::select
works with unquoted element, while select_custom
needs a quoted string? That’s because :
select_custom
uses the standard evaluation: R sees the symbolSpecies
, and tries to evaluate the standard way — i.e. by looking in the environment of the function for the value ofSpecies
. It doesn’t find it, so throws an error.- When
"Species"
is quoted, R evaluates it for what it is: a string. So R doesn’t try to return a value from it. dplyr::select
creates an environment, which has a custom method of evaluation. This is why you can pass unquoted string there — R will not look computer the symbol looking for a value in the env.
In each dplyr::function(df, var)
, every var
is evaluated in the environment of the function, which have special way of computed symbols. In the case of filter
, R looks for a column named var
in df
(in practice, that’s not exactly how it works, but you get the point).
This explains the error being thrown earlier: group_by
was looking for the col_group
column inside our data.frame.
Getting started
Then, the big question: how can we program with dplyr? How can we pass the unquoted Species
arg from the function gg_top
to our group_by
, and Sepal.Length
to the ggplot
? Let’s start by breaking our problem into two parts: the dplyr
, then the ggplot
.
So first, we need to create a function that takes a data.frame, makes a group_by
on a column, then returns the slice(5:10)
. Basically something doing:
iris %>%
group_by(Species) %>%
slice(5:10)
# We could think of
slicer <- function(df, var){
df %>%
group_by(var) %>%
slice(5:10)
}
slicer(df = iris, var = Species)
> Error in grouped_df_impl(data, unname(vars), drop) :
Column `var` is unknown
Here, you can see that R is looking for a var
column. That’s because var
is evaluated in the environment created by group_by
, so looking for the column var
in the iris
df. So how to prevent that?
We could think of:
slicer(df = iris, var = "Species")
> Error in grouped_df_impl(data, unname(vars), drop) :
Column `var` is unknown
But 1: that’s not working (because group_by
doesn’t take a string), 2: we don’t want to quote.
So the thing is: dplyr
functions work with a special type of objects, called quosure
— this is how symbols are evaluated. You can create them with quo()
.
quo(Species)
<quosure: global>
~Species
# So is this going to work?
slicer <- function(df, var){
df %>%
group_by(quo(var)) %>%
slice(5:10)
}
slicer(df = iris, var = Species)
Error in mutate_impl(.data, dots) :
Column `quo(var)` is of unsupported type quoted call
Nop! Obviously here, group_by(quo(var))
compute quo(var)
as a quosure, so it does:
quo(quo(var))
<quosure: frame>
~quo(var)
Not what we’ve been looking for either. We need a way to prevent the symbol var
from being evaluated the standard way, but evaluated with tidy eval. Good news, there’s a function for that — enquo()
. This function :
- Takes a symbol
- quotes the R code supplied
- captures the environment
- returns a quosure
Then, we need a way to tell group_by
that we’ve taken care to the “quosurisation” (that’s not the real word, you know!). So… here comes !!
(to be pronounced “Bang Bang” :) )
slicer <- function(df, var){
enquo_var <- enquo(var)
df %>%
# !! tells dplyr not to compute the object as a quosure
group_by(!!enquo_var) %>%
slice(5:10)
}
# That works!
slicer(df = iris, var = Species)
[emoji party]
the ggplot part
So now, we need to pass the col_group
and col_plot
into the ggplot call. We may be tempted to pass !!enquo_col_plot
the same way we passed it through group_by
. Thing is: tidy eval is not yet implemented in ggplot2
— so you can’t pass the enquo(var)
to it.
gg_top <- function(df, col_group, col_plot){
enquo_col_group <- enquo(col_group)
enquo_col_plot <- enquo(col_plot)
df %>%
group_by(!!enquo_col_group) %>%
slice(5:10) %>%
ggplot(aes(!!enquo_col_group, !!enquo_col_plot)) +
geom_point()
}
gg_top(df = iris, col_group = Species, col_plot = Sepal.Length)
Error in (function (x) : could not find 'enquo_var'
The trick is: you can use quo_name
, which returns a character string with the name of the expression you’ve typed. Pass it to ggplot2::aes_string
… and Voilà!
gg_top <- function(df, col_group, col_plot){
enquo_col_group <- enquo(col_group)
enquo_col_plot <- enquo(col_plot)
df %>%
group_by(!!enquo_col_group) %>%
slice(5:10) %>%
ggplot(aes_string(quo_name(enquo_col_group), quo_name(enquo_col_plot))) +
geom_point()
}
gg_top(df = iris, col_group = Species, col_plot = Sepal.Length)
[emoji party]^2
Sorry, that was quite a long post.. I hope it has enlightened some dark side of the tidyverse :)
Coming soon: more on tidy eval, environment, and computing on the R language.
What do you think?