Chapter 4 Webmining

4.1 Status code

4.1.1 I want to…

Create a status code checker.

4.1.3 Ok, but why?

compose(y, x) composes a function that will do y(x()). So here, get_status("url") will do status_code(GET("url")).

4.1.4 See also


4.2 Check status code

4.2.1 I want to…

Check for http status code for a list of pages.

4.2.3 Ok, but why?

200 is a status code that indicates that the connexion went smoothly. The every function here checks if all the status code we just GET are equal to 200.

4.2.4 See also

4.3 Scrape a list of urls which may fail

4.3.1 I want to…

Launch a read_html function on a list of webpages, and some may throw an error.

The difference with the function we saw previously ?

## Response [http://notexistingurl/]
##   Date: 2018-01-19 21:33
##   Status: 403
##   Content-Type: text/html; charset=iso-8859-1
##   Size: 202 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
## <html><head>
## <title>403 Forbidden</title>
## </head><body>
## <h1>Forbidden</h1>
## <p>You don't have permission to access /
## on this server.</p>
## </body></html>
## Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: notexistingurl.org

4.3.2 Here’s how to

## $`http://colinfay.me`
## {xml_document}
## <html lang="en" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="layout--home">\n\n    <!--[if lt IE 9]>\n<div class="no ...
## 
## $`http://thinkr.fr`
## {xml_document}
## <html class="no-js" lang="fr-FR">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="home page-template-default page page-id-35625 do-etfw f ...

4.3.3 Ok, but why

possibly turns a function into another function that returns what is defined in otherwise in case of failure. Here, we chose to return NULL.

compact() removes all the elements from a list which are NULL.

4.4 Getting h2

4.4.1 I want to…

Get the h2s from a list of urls.

4.4.2 Here’s how to

## $`http://colinfay.me`
## [1] "On the blog : "                                                                                     
## [2] "\n      \n        Combining the new {rtweet} and {tidytext}\n\n      \n    "                        
## [3] "\n      \n        [How to] Include a dancing banana in your R package documentation\n\n      \n    "
## [4] "\n      \n        Some random R benchmarks\n\n      \n    "                                         
## [5] "\n      \n        {attempt} is now on CRAN\n\n      \n    "                                         
## [6] "\n      \n        A Crazy Little Thing Called {purrr} - Part 6 : doing statistics\n\n      \n    "  
## 
## $`http://thinkr.fr`
##  [1] "Conseil, développement et formation au logiciel R"                              
##  [2] "Formez-vous au logiciel R !"                                                    
##  [3] "\r\n\t\tLe logo ThinkR créé avec la librairie {sf}\r\n\t"                       
##  [4] "\r\n\t\tR de jeu #2 : Happy new yeaR\r\n\t"                                     
##  [5] "\r\n\t\tDe retour de Budapest\r\n\t"                                            
##  [6] "\r\n\t\tR de jeu #1 : API, dataviz, et statistiques\r\n\t"                      
##  [7] "\r\n\t\tforcats, forcats, vous avez dit forcats ?\r\n\t"                        
##  [8] "\r\n\t\tText mining et topic modeling avec R\r\n\t"                             
##  [9] "\r\n\t\tÀ la découverte de {rvest}\r\n\t"                                       
## [10] "\r\n\t\tPremiers pas en Machine Learning avec R. Volume 4 : random forest\r\n\t"
## [11] "\r\n\t\tAu menu du jour : {R6} — Partie 2\r\n\t"

4.4.3 Ok, but why?

We are composing an h2 extractor by combining read_html, html_nodes and html_text. We then map this extractor to a list of urls, before setting the names of the results with the set_names function.

4.4.4 See also

4.5 JSON with many levels

4.5.1 I want to…

Extract all the test1 results.

4.5.2 Here’s how to:

## $obs1
## [1] "17"
## 
## $obs2
## [1] "12"

4.5.3 Ok, but why

What we called before is a shortcut for map(file, ~ pluck(.x, "id"). That shortcut works on the first level of the list. If you need to go deeper, you need to explicitely specify the pluck call.

Be careful, there is also a pluck in {rvest} that doesn’t behave exactly as the pluck from {purrr}.

## Error in FUN(X[[i]], ...): indice hors limites

4.5.4 See also


4.6 Several API call

4.6.1 I want to…

Make a series of API calls

4.6.2 Here’s how to:

nom code codeDepartement codeRegion codesPostaux population _score
Rennes 35238 35 53 35000, 35200, 35700 211373 1.0000000
Rennes-le-Château 11309 11 76 11190 58 0.7761594
Rennes-les-Bains 11310 11 76 11190 258 0.7583585
Rennes-sur-Loue 25488 25 27 25440 88 0.6743168
Rennes-en-Grenouilles 53189 53 52 53110 117 0.6239261
Vannes 56260 56 53 56000 53032 1.0000000
Vannes-le-Châtel 54548 54 44 54112 579 0.7510264
Pouy-sur-Vannes 10301 10 44 10290 145 0.6873107
Saulxures-lès-Vannes 54496 54 44 54170 363 0.6820142
Vannes-sur-Cosson 45331 45 24 45510 589 0.6652302
Brest 29019 29 53 29200 139386 0.7182376
Brestot 27110 27 28 27350 518 0.6957980
Esboz-Brest 70216 70 27 70300 485 0.4918154

4.6.3 Ok, but why?

Here, we are calling an API which returns a JSON object that can be easily turned into a df with {jsonlite}. So we choose to use map_df to return a simple data.frame of the three results.

4.6.4 See also