A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

5 minute(s) read

This is gonna be the last time I make this Queen reference in 2017.

I’ve been sharing some “stats with {purrr}” recipes on my Twitter account lately. Twitter being what it is (a source of infinite temporary content), here’s the post gathering some statistics tricks you can do with {purrr}.

Looking for normality

If your data are in a data.frame, you can simply map a shapiro.test on all the columns, keeping the one with a p.value > to 0.05 (remember, the shapiro tests for non-normality) :

library(purrr)
map(airquality, shapiro.test) %>% keep(~ .x$p.value > 0.05)
## $Wind
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.98575, p-value = 0.1178

Of course, you can do the same on a non-tabular list:

# rdunif is from purrr
l <- list(a = rnorm(10), b = rdunif(1000, 10))
map(l, shapiro.test) %>% keep(~ .x$p.value > 0.05)
## $a
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.91275, p-value = 0.3004

Also, map_if allows you to map only on numeric variables in your data.frame:

map_if(iris, is.numeric, shapiro.test)
## $Sepal.Length
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.97609, p-value = 0.01018
## 
## 
## $Sepal.Width
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.98492, p-value = 0.1012
## 
## 
## $Petal.Length
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.87627, p-value = 7.412e-10
## 
## 
## $Petal.Width
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.90183, p-value = 1.68e-08
## 
## 
## $Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Bulk cor.test

As said on Twitter, this might be p.hacking, but here’s how you can do a cor.test for all columns in a data.frame, with a little help from tidystringdist and broom :

library(tidystringdist) # Works since v0.1.2 
comb <- tidy_comb_all(names(airquality))
knitr::kable(comb)
V1 V2
Ozone Solar.R
Ozone Wind
Ozone Temp
Ozone Month
Ozone Day
Solar.R Wind
Solar.R Temp
Solar.R Month
Solar.R Day
Wind Temp
Wind Month
Wind Day
Temp Month
Temp Day
Month Day
bulk_cor <- pmap(comb, ~ cor.test(airquality[[.x]], airquality[[.y]])) %>% 
  map_df(broom::tidy) %>% 
  cbind(comb, .)

knitr::kable(bulk_cor, digits = 3)
V1 V2 estimate statistic p.value parameter conf.low conf.high method alternative
Ozone Solar.R 0.348 3.880 0.000 109 0.173 0.502 Pearson’s product-moment correlation two.sided
Ozone Wind -0.602 -8.040 0.000 114 -0.706 -0.471 Pearson’s product-moment correlation two.sided
Ozone Temp 0.698 10.418 0.000 114 0.591 0.781 Pearson’s product-moment correlation two.sided
Ozone Month 0.165 1.781 0.078 114 -0.018 0.337 Pearson’s product-moment correlation two.sided
Ozone Day -0.013 -0.141 0.888 114 -0.195 0.169 Pearson’s product-moment correlation two.sided
Solar.R Wind -0.057 -0.683 0.496 144 -0.217 0.107 Pearson’s product-moment correlation two.sided
Solar.R Temp 0.276 3.444 0.001 144 0.119 0.419 Pearson’s product-moment correlation two.sided
Solar.R Month -0.075 -0.906 0.366 144 -0.235 0.088 Pearson’s product-moment correlation two.sided
Solar.R Day -0.150 -1.824 0.070 144 -0.305 0.012 Pearson’s product-moment correlation two.sided
Wind Temp -0.458 -6.331 0.000 151 -0.575 -0.323 Pearson’s product-moment correlation two.sided
Wind Month -0.178 -2.227 0.027 151 -0.328 -0.020 Pearson’s product-moment correlation two.sided
Wind Day 0.027 0.334 0.739 151 -0.132 0.185 Pearson’s product-moment correlation two.sided
Temp Month 0.421 5.703 0.000 151 0.281 0.543 Pearson’s product-moment correlation two.sided
Temp Day -0.131 -1.619 0.108 151 -0.283 0.029 Pearson’s product-moment correlation two.sided
Month Day -0.008 -0.098 0.922 151 -0.166 0.151 Pearson’s product-moment correlation two.sided

Some Machine Learning

lm

Let’s build a linear model of all possible iris combinations:

res <- pmap(comb, ~ lm(airquality[[.x]] ~ airquality[[.y]]))
get_rsquared <- compose(as_mapper(~ .x$r.squared), summary)
map_dbl(res, get_rsquared)
##  [1] 1.213419e-01 3.618582e-01 4.877072e-01 2.706660e-02 1.749177e-04
##  [6] 3.225293e-03 7.608786e-02 5.670205e-03 2.258257e-02 2.097529e-01
## [11] 3.178824e-02 7.388015e-04 1.771966e-01 1.705458e-02 6.338966e-05

Any chance there’s a r.squared above 0.5 ?

map(res, get_rsquared) %>% some(~ .x > 0.5)
## [1] FALSE

rpart

We’ll build 20 rpart to find the better model. Yes, this will be better to do it with a random forest, but we’re here for the example :)

Training and validation

Let’s do the train / validation.

library(dplyr)
library(readr)
titanic <- read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv")
train <- rerun(20, sample_frac(titanic, size = 0.8))
validation <- map(train, ~ anti_join(titanic, .x))

Check if the sizes of all validation datasets are the same:

map_int(validation, nrow) %>% every(~ .x == 262)
## [1] TRUE

Create a model builder

We’ll create a simple model to predict survival based on sex.

library(rpart)
rpart_pimped <- partial(rpart, formula = survived ~ sex, method = "class")
res <- map(train, ~ rpart_pimped(data = .x))

Make prediction

prediction <- map2(validation, res, ~ predict(.y, .x, type = "class"))
w_prediction <- map2(validation, prediction, ~ mutate(.x, prediction = .y))

Confusion matrix

Now let’s map a conf matrix on all these results:

library(caret)
conf_mats <- map(w_prediction, ~ confusionMatrix(.x$prediction, .x$survived))

Is the “Sensivity” for all models above 0.8?

map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% every(~ .x > 0.8)
## [1] TRUE

Is the “Specificity” for all models above 0.8?

map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% every(~ .x > 0.8)
## [1] FALSE

keep_index

Let’s modify keep a little bit so we can extract the models we need:

# Here's the keep source code
keep
## function(.x, .p, ...) {
##   sel <- probe(.x, .p, ...)
##   .x[!is.na(sel) & sel]
## }
## <environment: namespace:purrr>
# Let's tweak it a little bit
keep_index <- function(.x, .p, ...) {
  sel <- purrr:::probe(.x, .p, ...)
  which(sel)
}

So, which are the models with a sensitivity superior to 0.85?

map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% keep_index(~ .x > 0.85)
##  [1]  1  3  4  6 11 13 14 15 16 18 19

And the models with a specificity superior to 0.7?

map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% keep_index(~ .x > 0.7)
## [1]  2  7 13 14 17

Which models are in both?

sens <- map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% keep_index(~ .x > 0.85)
spec <- map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% keep_index(~ .x > 0.7)
keep(sens, map_lgl(sens, ~ .x %in% spec))
## [1] 13 14

So, I guess we’ll go for model(s) number 13, 14!

What do you think?