Chapter 2 Statistics
2.1 Compute the mean
2.1.1 I want to…
Get the mean of vectors of unequal length.
2.1.2 Here’s how to:
numbers <- list(rnorm(10), rnorm(10), rnorm(1000))
trim <- 20
na_rm <- TRUE
pmap_dbl(list(numbers, trim, na_rm), ~ mean(..1, ..2,..3))
## [1] 0.11880200 -0.20132631 0.01946699
2.1.3 Ok, but why?
pmap
takes a list of list as an input, and send them to the function. In .f
, you can refer to the list arguments with their position: here, ..1
, ..2
and ..3
.
2.1.4 See also
2.2 Running a shapiro test
Given the dataset airquality
.
2.2.1 I want to…
Look for normality on all columns, and know the one which are normal:
2.2.2 Here’s how to:
## $Wind
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.98575, p-value = 0.1178
2.2.3 Ok, but why?
In R, data.frame are lists of vectors of same length. So, you can apply a function the same way you would apply a function on any list. Here, we are mapping a shapiro.test
, on all columns, and we keep
only the elements with a .x$p.value
which is more than 0.05.
2.3 Test only numeric columns
2.3.1 I want to…
Make sure I make my statistical test on numeric values.
2.3.2 Here’s how to:
## $Sepal.Length
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.97609, p-value = 0.01018
##
##
## $Sepal.Width
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.98492, p-value = 0.1012
##
##
## $Petal.Length
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.87627, p-value = 7.412e-10
##
##
## $Petal.Width
##
## Shapiro-Wilk normality test
##
## data: .x[[i]]
## W = 0.90183, p-value = 1.68e-08
##
##
## $Species
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
2.3.3 Ok, but why?
map_if
runs .f
only if the .x
verifies the condition .p
.
2.4 cor.test
2.4.1 I want to…
Make a bulk cor.test of all my variables.
2.4.2 Here’s how to:
library(tidystringdist) # Works since v0.1.2
comb <- tidy_comb_all(names(airquality))
pmap(comb, ~ cor.test(airquality[[.x]], airquality[[.y]])) %>%
map_df(broom::tidy) %>%
cbind(comb, .) %>%
select(V1:parameter)
V1 | V2 | estimate | statistic | p.value | parameter |
---|---|---|---|---|---|
Ozone | Solar.R | 0.3483417 | 3.8797948 | 0.0001793 | 109 |
Ozone | Wind | -0.6015465 | -8.0401299 | 0.0000000 | 114 |
Ozone | Temp | 0.6983603 | 10.4177242 | 0.0000000 | 114 |
Ozone | Month | 0.1645193 | 1.7808517 | 0.0776001 | 114 |
Ozone | Day | -0.0132256 | -0.1412236 | 0.8879425 | 114 |
Solar.R | Wind | -0.0567917 | -0.6826017 | 0.4959552 | 144 |
Solar.R | Temp | 0.2758403 | 3.4436863 | 0.0007518 | 144 |
Solar.R | Month | -0.0753008 | -0.9061819 | 0.3663534 | 144 |
Solar.R | Day | -0.1502750 | -1.8240128 | 0.0702234 | 144 |
Wind | Temp | -0.4579879 | -6.3308351 | 0.0000000 | 151 |
Wind | Month | -0.1782926 | -2.2265711 | 0.0274562 | 151 |
Wind | Day | 0.0271809 | 0.3341280 | 0.7387466 | 151 |
Temp | Month | 0.4209473 | 5.7025370 | 0.0000001 | 151 |
Temp | Day | -0.1305932 | -1.6186176 | 0.1076164 | 151 |
Month | Day | -0.0079618 | -0.0978389 | 0.9221900 | 151 |
2.4.3 Ok, but why?
comb
is a table containing all combinations of the names of the columns. What we do is mapping a cor.test
on all these combinations by extracting, each time, the column as a vector, with airquality[[.x]]
and airquality[[.y]]
.
pmap
allows to use a list as a signe input.
2.5 Linear regression
2.5.1 I want to…
Get the r.squared of each of the possible lm of airquality combinations.
2.5.2 Here’s how to:
res <- pmap(comb, ~ lm(airquality[[.x]] ~ airquality[[.y]]))
get_rsquared <- compose(as_mapper(~ .x$r.squared), summary)
map_dbl(res, get_rsquared)
## [1] 1.213419e-01 3.618582e-01 4.877072e-01 2.706660e-02 1.749177e-04
## [6] 3.225293e-03 7.608786e-02 5.670205e-03 2.258257e-02 2.097529e-01
## [11] 3.178824e-02 7.388015e-04 1.771966e-01 1.705458e-02 6.338966e-05
2.5.3 Ok, but why?
We’re building a model of all combinations with pmap
, just as before with cor.test
.
Then, the get_rsquared
function is a composition of extracting the r.squared of the summary of a lm result. compose(x, y)
allows to build x(y())
. Here, we are combining a mapper extracting the r.squared
element out of the summary()
of a lm
.
2.6 significant r.squared
2.6.1 I want to…
Know if some r.square are above O.5 :
2.6.2 Here’s how to:
res <- pmap(comb, ~ lm(airquality[[.x]] ~ airquality[[.y]]))
get_rsquared <- compose(as_mapper(~ .x$r.squared), summary)
map_dbl(res, get_rsquared) %>% some(~ .x > 0.5)
## [1] FALSE
2.6.3 Ok, but why?
some
checks if any of the input validate the condition.
2.7 test and validation
2.7.1 I want to…
Create 20 test and validation datasets.
2.7.2 Here’s how to:
2.7.3 Ok, but why?
rerun
runs the sampling 20 times. To obtain the 20 validation sets, we anti-join each elements of the train list with the original dataframe. That way, train[1]
+ validation[1]
= titanic
, train[2]
+ validation[2]
= titanic
, etc
2.8 rpart
2.8.1 I want to…
Create 20 rpart
, modeled on my 20 elements in the test
list.
2.8.2 Here’s how to:
library(rpart)
rpart_pimped <- partial(rpart, formula = survived ~ sex, method = "class")
res <- map(train, rpart_pimped)
res[[1]]
## n= 1047
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1047 408 0 (0.6103152 0.3896848)
## 2) sex=male 666 127 0 (0.8093093 0.1906907) *
## 3) sex=female 381 100 1 (0.2624672 0.7375328) *
2.8.3 Ok, but why?
partial
allows to build a prefil function, which is then mapped on each element.
2.9 Make prediction
2.9.1 I want to…
Make prediction based on my models
2.9.2 Here’s how to:
2.9.3 Ok, but why?
map2
allows to map on two arguments.
2.10 Confusion matrix
2.10.1 I want to…
Create a conf matrix on all these results:
2.10.2 Here’s how to:
2.10.3 Ok, but why?
You can use .x
as many times as you want in .f
.
2.11 Sensitivity and Specificity
2.11.1 I want to…
Detect which models have a specificity above 0.7 and sensitivity above 0.85 (randomly chosen numbers).
2.11.2 Here’s how to:
keep_index <- function(.x, .p, ...) {
sel <- purrr:::probe(.x, .p, ...)
which(sel)
}
sens <- map_dbl(conf_mats, ~ .x$byClass["Sensitivity"]) %>% keep_index(~ .x > 0.85)
spec <- map_dbl(conf_mats, ~ .x$byClass["Specificity"]) %>% keep_index(~ .x > 0.7)
keep(sens, map_lgl(sens, ~ .x %in% spec))
## [1] 2 5
2.11.3 Ok, but why?
We have created a function that returns the position of elements that validates a condition. sens
is the vector containing the position with sensitivity above 0.85, spec
the vector for specificity above 0.7.
Then, we pass to keep
a vector of logical built with map_lgl
. This vector tells if each elements of sens
is or isn’t in spec
.