Chapter 2 Statistics

2.1 Compute the mean

2.1.1 I want to…

Get the mean of vectors of unequal length.

2.1.3 Ok, but why?

pmap takes a list of list as an input, and send them to the function. In .f, you can refer to the list arguments with their position: here, ..1, ..2 and ..3.

2.1.4 See also


2.2 Running a shapiro test

Given the dataset airquality.

2.2.1 I want to…

Look for normality on all columns, and know the one which are normal:

2.2.2 Here’s how to:

## $Wind
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.98575, p-value = 0.1178

2.2.3 Ok, but why?

In R, data.frame are lists of vectors of same length. So, you can apply a function the same way you would apply a function on any list. Here, we are mapping a shapiro.test, on all columns, and we keep only the elements with a .x$p.value which is more than 0.05.

2.3 Test only numeric columns

2.3.1 I want to…

Make sure I make my statistical test on numeric values.

2.3.2 Here’s how to:

## $Sepal.Length
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.97609, p-value = 0.01018
## 
## 
## $Sepal.Width
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.98492, p-value = 0.1012
## 
## 
## $Petal.Length
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.87627, p-value = 7.412e-10
## 
## 
## $Petal.Width
## 
##  Shapiro-Wilk normality test
## 
## data:  .x[[i]]
## W = 0.90183, p-value = 1.68e-08
## 
## 
## $Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

2.3.3 Ok, but why?

map_if runs .f only if the .x verifies the condition .p.

2.4 cor.test

2.4.1 I want to…

Make a bulk cor.test of all my variables.

2.4.2 Here’s how to:

V1 V2 estimate statistic p.value parameter
Ozone Solar.R 0.3483417 3.8797948 0.0001793 109
Ozone Wind -0.6015465 -8.0401299 0.0000000 114
Ozone Temp 0.6983603 10.4177242 0.0000000 114
Ozone Month 0.1645193 1.7808517 0.0776001 114
Ozone Day -0.0132256 -0.1412236 0.8879425 114
Solar.R Wind -0.0567917 -0.6826017 0.4959552 144
Solar.R Temp 0.2758403 3.4436863 0.0007518 144
Solar.R Month -0.0753008 -0.9061819 0.3663534 144
Solar.R Day -0.1502750 -1.8240128 0.0702234 144
Wind Temp -0.4579879 -6.3308351 0.0000000 151
Wind Month -0.1782926 -2.2265711 0.0274562 151
Wind Day 0.0271809 0.3341280 0.7387466 151
Temp Month 0.4209473 5.7025370 0.0000001 151
Temp Day -0.1305932 -1.6186176 0.1076164 151
Month Day -0.0079618 -0.0978389 0.9221900 151

2.4.3 Ok, but why?

comb is a table containing all combinations of the names of the columns. What we do is mapping a cor.test on all these combinations by extracting, each time, the column as a vector, with airquality[[.x]] and airquality[[.y]].

pmap allows to use a list as a signe input.

2.5 Linear regression

2.5.1 I want to…

Get the r.squared of each of the possible lm of airquality combinations.

2.5.2 Here’s how to:

##  [1] 1.213419e-01 3.618582e-01 4.877072e-01 2.706660e-02 1.749177e-04
##  [6] 3.225293e-03 7.608786e-02 5.670205e-03 2.258257e-02 2.097529e-01
## [11] 3.178824e-02 7.388015e-04 1.771966e-01 1.705458e-02 6.338966e-05

2.5.3 Ok, but why?

We’re building a model of all combinations with pmap, just as before with cor.test.

Then, the get_rsquared function is a composition of extracting the r.squared of the summary of a lm result. compose(x, y) allows to build x(y()). Here, we are combining a mapper extracting the r.squared element out of the summary() of a lm.

2.6 significant r.squared

2.6.1 I want to…

Know if some r.square are above O.5 :

2.6.3 Ok, but why?

some checks if any of the input validate the condition.

2.7 test and validation

2.7.1 I want to…

Create 20 test and validation datasets.

2.7.3 Ok, but why?

rerun runs the sampling 20 times. To obtain the 20 validation sets, we anti-join each elements of the train list with the original dataframe. That way, train[1] + validation[1] = titanic, train[2] + validation[2] = titanic, etc

2.8 rpart

2.8.1 I want to…

Create 20 rpart, modeled on my 20 elements in the test list.

2.8.2 Here’s how to:

## n= 1047 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1047 408 0 (0.6103152 0.3896848)  
##   2) sex=male 666 127 0 (0.8093093 0.1906907) *
##   3) sex=female 381 100 1 (0.2624672 0.7375328) *

2.8.3 Ok, but why?

partial allows to build a prefil function, which is then mapped on each element.

2.9 Make prediction

2.9.1 I want to…

Make prediction based on my models

2.9.3 Ok, but why?

map2 allows to map on two arguments.

2.10 Confusion matrix

2.10.1 I want to…

Create a conf matrix on all these results:

2.10.3 Ok, but why?

You can use .x as many times as you want in .f.

2.11 Sensitivity and Specificity

2.11.1 I want to…

Detect which models have a specificity above 0.7 and sensitivity above 0.85 (randomly chosen numbers).

2.11.3 Ok, but why?

We have created a function that returns the position of elements that validates a condition. sens is the vector containing the position with sensitivity above 0.85, spec the vector for specificity above 0.7.

Then, we pass to keep a vector of logical built with map_lgl. This vector tells if each elements of sens is or isn’t in spec.