Chapter 2 Statistics

set.seed(2811)

2.1 Compute the mean

2.1.1 I want to…

Get the mean of vectors of unequal length.

2.1.2 Here’s how to:

numbers <- list(rnorm(10), rnorm(10), rnorm(1000))
trim <- 20
na_rm <- TRUE
pmap_dbl(list(numbers, trim, na_rm), ~ mean(..1, ..2,..3))

## [1]  0.11880200 -0.20132631  0.01946699

2.1.3 Ok, but why?

pmap takes a list of list as an input, and send them to the function. In .f, you can refer to the list arguments with their position: here, ..1, ..2 and ..3.

2.1.4 See also

2.2 Running a shapiro test

Given the dataset airquality.

2.2.1 I want to…

Look for normality on all columns, and know the one which are normal:

2.2.2 Here’s how to:

map(airquality, shapiro.test) %>% keep(~ .x$p.value > 0.05)

## $Wind
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.98575, p-value = 0.1178

2.2.3 Ok, but why?

In R, data.frame are lists of vectors of same length. So, you can apply a function the same way you would apply a function on any list. Here, we are mapping a shapiro.test, on all columns, and we keep only the elements with a .x$p.value which is more than 0.05.

2.2.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.3 Test only numeric columns

2.3.1 I want to…

Make sure I make my statistical test on numeric values.

2.3.2 Here’s how to:

map_if(.x = iris, .p = is.numeric, .f = shapiro.test)

## $Sepal.Length
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.97609, p-value = 0.01018
## 
## 
## $Sepal.Width
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.98492, p-value = 0.1012
## 
## 
## $Petal.Length
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.87627, p-value = 7.412e-10
## 
## 
## $Petal.Width
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.90183, p-value = 1.68e-08
## 
## 
## $Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

2.3.3 Ok, but why?

map_if runs .f only if the .x verifies the condition .p.

2.3.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.4 cor.test

2.4.1 I want to…

Make a bulk cor.test of all my variables.

2.4.2 Here’s how to:

library(tidystringdist) # Works since v0.1.2 
comb <- tidy_comb_all(names(airquality))
pmap(comb, ~ cor.test(airquality[[.x]], airquality[[.y]])) %>% 
  map_df(broom::tidy) %>% 
  cbind(comb, .) %>%
  select(V1:parameter)

V1	V2	estimate	statistic	p.value	parameter
Ozone	Solar.R	0.3483417	3.8797948	0.0001793	109
Ozone	Wind	-0.6015465	-8.0401299	0.0000000	114
Ozone	Temp	0.6983603	10.4177242	0.0000000	114
Ozone	Month	0.1645193	1.7808517	0.0776001	114
Ozone	Day	-0.0132256	-0.1412236	0.8879425	114
Solar.R	Wind	-0.0567917	-0.6826017	0.4959552	144
Solar.R	Temp	0.2758403	3.4436863	0.0007518	144
Solar.R	Month	-0.0753008	-0.9061819	0.3663534	144
Solar.R	Day	-0.1502750	-1.8240128	0.0702234	144
Wind	Temp	-0.4579879	-6.3308351	0.0000000	151
Wind	Month	-0.1782926	-2.2265711	0.0274562	151
Wind	Day	0.0271809	0.3341280	0.7387466	151
Temp	Month	0.4209473	5.7025370	0.0000001	151
Temp	Day	-0.1305932	-1.6186176	0.1076164	151
Month	Day	-0.0079618	-0.0978389	0.9221900	151

2.4.3 Ok, but why?

comb is a table containing all combinations of the names of the columns. What we do is mapping a cor.test on all these combinations by extracting, each time, the column as a vector, with airquality[[.x]] and airquality[[.y]].

pmap allows to use a list as a signe input.

2.4.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.5 Linear regression

2.5.1 I want to…

Get the r.squared of each of the possible lm of airquality combinations.

2.5.2 Here’s how to:

res <- pmap(comb, ~ lm(airquality[[.x]] ~ airquality[[.y]]))
get_rsquared <- compose(as_mapper(~ .x$r.squared), summary)
map_dbl(res, get_rsquared)

##  [1] 1.213419e-01 3.618582e-01 4.877072e-01 2.706660e-02 1.749177e-04
##  [6] 3.225293e-03 7.608786e-02 5.670205e-03 2.258257e-02 2.097529e-01
## [11] 3.178824e-02 7.388015e-04 1.771966e-01 1.705458e-02 6.338966e-05

2.5.3 Ok, but why?

We’re building a model of all combinations with pmap, just as before with cor.test.

Then, the get_rsquared function is a composition of extracting the r.squared of the summary of a lm result. compose(x, y) allows to build x(y()). Here, we are combining a mapper extracting the r.squared element out of the summary() of a lm.

2.5.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.6 significant r.squared

2.6.1 I want to…

Know if some r.square are above O.5 :

2.6.2 Here’s how to:

res <- pmap(comb, ~ lm(airquality[[.x]] ~ airquality[[.y]]))
get_rsquared <- compose(as_mapper(~ .x$r.squared), summary)
map_dbl(res, get_rsquared) %>% some(~ .x > 0.5)

## [1] FALSE

2.6.3 Ok, but why?

some checks if any of the input validate the condition.

2.6.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.7 test and validation

2.7.1 I want to…

Create 20 test and validation datasets.

2.7.2 Here’s how to:

# From http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv
titanic <- readr::read_csv("titanic.csv")
train <- rerun(20, sample_frac(titanic, size = 0.8))
validation <- map(train, ~ anti_join(titanic, .x))

2.7.3 Ok, but why?

rerun runs the sampling 20 times. To obtain the 20 validation sets, we anti-join each elements of the train list with the original dataframe. That way, train[1] + validation[1] = titanic, train[2] + validation[2] = titanic, etc

2.7.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.8 rpart

2.8.1 I want to…

Create 20 rpart, modeled on my 20 elements in the test list.

2.8.2 Here’s how to:

library(rpart)
rpart_pimped <- partial(rpart, formula = survived ~ sex, method = "class")
res <- map(train, rpart_pimped)
res[[1]]

## n= 1047 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1047 408 0 (0.6103152 0.3896848)  
##   2) sex=male 666 127 0 (0.8093093 0.1906907) *
##   3) sex=female 381 100 1 (0.2624672 0.7375328) *

2.8.3 Ok, but why?

partial allows to build a prefil function, which is then mapped on each element.

2.8.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.9 Make prediction

2.9.1 I want to…

Make prediction based on my models

2.9.2 Here’s how to:

prediction <- map2(validation, res, ~ predict(.y, .x, type = "class"))
w_prediction <- map2(validation, prediction, ~ mutate(.x, prediction = .y))

2.9.3 Ok, but why?

map2 allows to map on two arguments.

2.9.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.10 Confusion matrix

2.10.1 I want to…

Create a conf matrix on all these results:

2.10.2 Here’s how to:

library(caret)
conf_mats <- map(w_prediction, ~ confusionMatrix(.x$prediction, .x$survived))

2.10.3 Ok, but why?

You can use .x as many times as you want in .f.

2.10.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics

2.11 Sensitivity and Specificity

2.11.1 I want to…

Detect which models have a specificity above 0.7 and sensitivity above 0.85 (randomly chosen numbers).

2.11.2 Here’s how to:

keep_index <- function(.x, .p, ...) {
  sel <- purrr:::probe(.x, .p, ...)
  which(sel)
}
sens <- map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% keep_index(~ .x > 0.85)
spec <- map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% keep_index(~ .x > 0.7)
keep(sens, map_lgl(sens, ~ .x %in% spec))

## [1] 2 5

2.11.3 Ok, but why?

We have created a function that returns the position of elements that validates a condition. sens is the vector containing the position with sensitivity above 0.85, spec the vector for specificity above 0.7.

Then, we pass to keep a vector of logical built with map_lgl. This vector tells if each elements of sens is or isn’t in spec.

2.11.4 See also

A Crazy Little Thing Called {purrr} - Part 6 : doing statistics