A Crazy Little Thing Called {purrr} - Part 6 : doing statistics
This is gonna be the last time I make this Queen reference in 2017.
I’ve been sharing some “stats with {purrr}” recipes on my Twitter account lately. Twitter being what it is (a source of infinite temporary content), here’s the post gathering some statistics tricks you can do with {purrr}.
Looking for normality
If your data are in a data.frame, you can simply map a shapiro.test on
all the columns, keeping the one with a p.value
> to 0.05 (remember,
the shapiro tests for non-normality) :
library(purrr)
map(airquality, shapiro.test) %>% keep(~ .x$p.value > 0.05)
## $Wind
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.98575, p-value = 0.1178
Of course, you can do the same on a non-tabular list:
# rdunif is from purrr
l <- list(a = rnorm(10), b = rdunif(1000, 10))
map(l, shapiro.test) %>% keep(~ .x$p.value > 0.05)
## $a
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.91275, p-value = 0.3004
Also, map_if
allows you to map only on numeric variables in your
data.frame:
map_if(iris, is.numeric, shapiro.test)
## $Sepal.Length
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.97609, p-value = 0.01018
##
##
## $Sepal.Width
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.98492, p-value = 0.1012
##
##
## $Petal.Length
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.87627, p-value = 7.412e-10
##
##
## $Petal.Width
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.90183, p-value = 1.68e-08
##
##
## $Species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Bulk cor.test
As said on Twitter, this might be
p.hacking, but
here’s how you can do a cor.test
for all columns in a data.frame, with
a little help from tidystringdist
and broom
:
library(tidystringdist) # Works since v0.1.2
comb <- tidy_comb_all(names(airquality))
knitr::kable(comb)
V1 | V2 |
---|---|
Ozone | Solar.R |
Ozone | Wind |
Ozone | Temp |
Ozone | Month |
Ozone | Day |
Solar.R | Wind |
Solar.R | Temp |
Solar.R | Month |
Solar.R | Day |
Wind | Temp |
Wind | Month |
Wind | Day |
Temp | Month |
Temp | Day |
Month | Day |
bulk_cor <- pmap(comb, ~ cor.test(airquality[[.x]], airquality[[.y]])) %>%
map_df(broom::tidy) %>%
cbind(comb, .)
knitr::kable(bulk_cor, digits = 3)
V1 | V2 | estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
Ozone | Solar.R | 0.348 | 3.880 | 0.000 | 109 | 0.173 | 0.502 | Pearson’s product-moment correlation | two.sided |
Ozone | Wind | -0.602 | -8.040 | 0.000 | 114 | -0.706 | -0.471 | Pearson’s product-moment correlation | two.sided |
Ozone | Temp | 0.698 | 10.418 | 0.000 | 114 | 0.591 | 0.781 | Pearson’s product-moment correlation | two.sided |
Ozone | Month | 0.165 | 1.781 | 0.078 | 114 | -0.018 | 0.337 | Pearson’s product-moment correlation | two.sided |
Ozone | Day | -0.013 | -0.141 | 0.888 | 114 | -0.195 | 0.169 | Pearson’s product-moment correlation | two.sided |
Solar.R | Wind | -0.057 | -0.683 | 0.496 | 144 | -0.217 | 0.107 | Pearson’s product-moment correlation | two.sided |
Solar.R | Temp | 0.276 | 3.444 | 0.001 | 144 | 0.119 | 0.419 | Pearson’s product-moment correlation | two.sided |
Solar.R | Month | -0.075 | -0.906 | 0.366 | 144 | -0.235 | 0.088 | Pearson’s product-moment correlation | two.sided |
Solar.R | Day | -0.150 | -1.824 | 0.070 | 144 | -0.305 | 0.012 | Pearson’s product-moment correlation | two.sided |
Wind | Temp | -0.458 | -6.331 | 0.000 | 151 | -0.575 | -0.323 | Pearson’s product-moment correlation | two.sided |
Wind | Month | -0.178 | -2.227 | 0.027 | 151 | -0.328 | -0.020 | Pearson’s product-moment correlation | two.sided |
Wind | Day | 0.027 | 0.334 | 0.739 | 151 | -0.132 | 0.185 | Pearson’s product-moment correlation | two.sided |
Temp | Month | 0.421 | 5.703 | 0.000 | 151 | 0.281 | 0.543 | Pearson’s product-moment correlation | two.sided |
Temp | Day | -0.131 | -1.619 | 0.108 | 151 | -0.283 | 0.029 | Pearson’s product-moment correlation | two.sided |
Month | Day | -0.008 | -0.098 | 0.922 | 151 | -0.166 | 0.151 | Pearson’s product-moment correlation | two.sided |
Some Machine Learning
lm
Let’s build a linear model of all possible iris combinations:
res <- pmap(comb, ~ lm(airquality[[.x]] ~ airquality[[.y]]))
get_rsquared <- compose(as_mapper(~ .x$r.squared), summary)
map_dbl(res, get_rsquared)
## [1] 1.213419e-01 3.618582e-01 4.877072e-01 2.706660e-02 1.749177e-04
## [6] 3.225293e-03 7.608786e-02 5.670205e-03 2.258257e-02 2.097529e-01
## [11] 3.178824e-02 7.388015e-04 1.771966e-01 1.705458e-02 6.338966e-05
Any chance there’s a r.squared above 0.5 ?
map(res, get_rsquared) %>% some(~ .x > 0.5)
## [1] FALSE
rpart
We’ll build 20 rpart
to find the better model. Yes, this will be
better to do it with a random forest, but we’re here for the example :)
Training and validation
Let’s do the train / validation.
library(dplyr)
library(readr)
titanic <- read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv")
train <- rerun(20, sample_frac(titanic, size = 0.8))
validation <- map(train, ~ anti_join(titanic, .x))
Check if the sizes of all validation datasets are the same:
map_int(validation, nrow) %>% every(~ .x == 262)
## [1] TRUE
Create a model builder
We’ll create a simple model to predict survival based on sex.
library(rpart)
rpart_pimped <- partial(rpart, formula = survived ~ sex, method = "class")
res <- map(train, ~ rpart_pimped(data = .x))
Make prediction
prediction <- map2(validation, res, ~ predict(.y, .x, type = "class"))
w_prediction <- map2(validation, prediction, ~ mutate(.x, prediction = .y))
Confusion matrix
Now let’s map a conf matrix on all these results:
library(caret)
conf_mats <- map(w_prediction, ~ confusionMatrix(.x$prediction, .x$survived))
Is the “Sensivity” for all models above 0.8?
map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% every(~ .x > 0.8)
## [1] TRUE
Is the “Specificity” for all models above 0.8?
map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% every(~ .x > 0.8)
## [1] FALSE
keep_index
Let’s modify keep
a little bit so we can extract the models we need:
# Here's the keep source code
keep
## function(.x, .p, ...) {
## sel <- probe(.x, .p, ...)
## .x[!is.na(sel) & sel]
## }
## <environment: namespace:purrr>
# Let's tweak it a little bit
keep_index <- function(.x, .p, ...) {
sel <- purrr:::probe(.x, .p, ...)
which(sel)
}
So, which are the models with a sensitivity superior to 0.85?
map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% keep_index(~ .x > 0.85)
## [1] 1 3 4 6 11 13 14 15 16 18 19
And the models with a specificity superior to 0.7?
map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% keep_index(~ .x > 0.7)
## [1] 2 7 13 14 17
Which models are in both?
sens <- map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% keep_index(~ .x > 0.85)
spec <- map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% keep_index(~ .x > 0.7)
keep(sens, map_lgl(sens, ~ .x %in% spec))
## [1] 13 14
So, I guess we’ll go for model(s) number 13, 14!
What do you think?