Chapter 4 Webmining
4.1 Status code
4.1.1 I want to…
Create a status code checker.
4.1.2 Here’s how to
## [1] 200
4.1.3 Ok, but why?
compose(y, x)
composes a function that will do y(x())
. So here, get_status("url")
will do status_code(GET("url"))
.
4.1.4 See also
4.2 Check status code
4.2.1 I want to…
Check for http status code for a list of pages.
4.2.2 Here’s how to
urls <- c("http://colinfay.me", "http://thinkr.fr", "reallynotanadress")
get_status <- compose(status_code, GET)
map(urls, get_status) %>% every(~ .x == 200)
## [1] FALSE
4.2.3 Ok, but why?
200 is a status code that indicates that the connexion went smoothly. The every
function here checks if all the status code we just GET
are equal to 200.
4.2.4 See also
4.3 Scrape a list of urls which may fail
4.3.1 I want to…
Launch a read_html
function on a list of webpages, and some may throw an error.
The difference with the function we saw previously ?
## Response [http://notexistingurl/]
## Date: 2018-01-19 21:33
## Status: 403
## Content-Type: text/html; charset=iso-8859-1
## Size: 202 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
## <html><head>
## <title>403 Forbidden</title>
## </head><body>
## <h1>Forbidden</h1>
## <p>You don't have permission to access /
## on this server.</p>
## </body></html>
## Error in curl::curl_fetch_memory(url, handle = handle): Could not resolve host: notexistingurl.org
4.3.2 Here’s how to
library(rvest)
urls <- c("http://colinfay.me", "http://thinkr.fr", "reallynotanadress")
possible_read <- possibly(read_html, otherwise = NULL)
map(urls, possible_read) %>% set_names(urls) %>% compact()
## $`http://colinfay.me`
## {xml_document}
## <html lang="en" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="layout--home">\n\n <!--[if lt IE 9]>\n<div class="no ...
##
## $`http://thinkr.fr`
## {xml_document}
## <html class="no-js" lang="fr-FR">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="home page-template-default page page-id-35625 do-etfw f ...
4.3.3 Ok, but why
possibly
turns a function into another function that returns what is defined in otherwise in case of failure. Here, we chose to return NULL.
compact()
removes all the elements from a list which are NULL.
4.4 Getting h2
4.4.1 I want to…
Get the h2s from a list of urls.
4.4.2 Here’s how to
get_h2 <- compose(html_text,
as_mapper(~ html_nodes(.x, "h2")),
read_html)
urls <- c("http://colinfay.me", "http://thinkr.fr")
map(urls, get_h2) %>% set_names(urls)
## $`http://colinfay.me`
## [1] "On the blog : "
## [2] "\n \n Combining the new {rtweet} and {tidytext}\n\n \n "
## [3] "\n \n [How to] Include a dancing banana in your R package documentation\n\n \n "
## [4] "\n \n Some random R benchmarks\n\n \n "
## [5] "\n \n {attempt} is now on CRAN\n\n \n "
## [6] "\n \n A Crazy Little Thing Called {purrr} - Part 6 : doing statistics\n\n \n "
##
## $`http://thinkr.fr`
## [1] "Conseil, développement et formation au logiciel R"
## [2] "Formez-vous au logiciel R !"
## [3] "\r\n\t\tLe logo ThinkR créé avec la librairie {sf}\r\n\t"
## [4] "\r\n\t\tR de jeu #2 : Happy new yeaR\r\n\t"
## [5] "\r\n\t\tDe retour de Budapest\r\n\t"
## [6] "\r\n\t\tR de jeu #1 : API, dataviz, et statistiques\r\n\t"
## [7] "\r\n\t\tforcats, forcats, vous avez dit forcats ?\r\n\t"
## [8] "\r\n\t\tText mining et topic modeling avec R\r\n\t"
## [9] "\r\n\t\tÀ la découverte de {rvest}\r\n\t"
## [10] "\r\n\t\tPremiers pas en Machine Learning avec R. Volume 4 : random forest\r\n\t"
## [11] "\r\n\t\tAu menu du jour : {R6} — Partie 2\r\n\t"
4.4.3 Ok, but why?
We are composing an h2 extractor by combining read_html
, html_nodes
and html_text
. We then map this extractor to a list of urls, before setting the names of the results with the set_names
function.
4.4.4 See also
4.5 JSON with many levels
4.5.1 I want to…
Extract all the test1 results.
4.5.2 Here’s how to:
## $obs1
## [1] "17"
##
## $obs2
## [1] "12"
4.5.3 Ok, but why
What we called before is a shortcut for map(file, ~ pluck(.x, "id")
. That shortcut works on the first level of the list. If you need to go deeper, you need to explicitely specify the pluck
call.
Be careful, there is also a pluck
in {rvest} that doesn’t behave exactly as the pluck
from {purrr}.
## Error in FUN(X[[i]], ...): indice hors limites
4.5.4 See also
4.6 Several API call
4.6.1 I want to…
Make a series of API calls
4.6.2 Here’s how to:
library(attempt)
library(curl)
caller <- function(x){
# verify internet connexion
stop_if_not(has_internet(), msg = "You should have internet to do that")
res <- GET(url = "https://geo.api.gouv.fr/communes", query = list(nom = x))
res$content %>% rawToChar() %>% jsonlite::fromJSON(simplifyDataFrame = TRUE)
}
city <- c("Rennes","Vannes","Brest")
map_df(city, caller)
nom | code | codeDepartement | codeRegion | codesPostaux | population | _score |
---|---|---|---|---|---|---|
Rennes | 35238 | 35 | 53 | 35000, 35200, 35700 | 211373 | 1.0000000 |
Rennes-le-Château | 11309 | 11 | 76 | 11190 | 58 | 0.7761594 |
Rennes-les-Bains | 11310 | 11 | 76 | 11190 | 258 | 0.7583585 |
Rennes-sur-Loue | 25488 | 25 | 27 | 25440 | 88 | 0.6743168 |
Rennes-en-Grenouilles | 53189 | 53 | 52 | 53110 | 117 | 0.6239261 |
Vannes | 56260 | 56 | 53 | 56000 | 53032 | 1.0000000 |
Vannes-le-Châtel | 54548 | 54 | 44 | 54112 | 579 | 0.7510264 |
Pouy-sur-Vannes | 10301 | 10 | 44 | 10290 | 145 | 0.6873107 |
Saulxures-lès-Vannes | 54496 | 54 | 44 | 54170 | 363 | 0.6820142 |
Vannes-sur-Cosson | 45331 | 45 | 24 | 45510 | 589 | 0.6652302 |
Brest | 29019 | 29 | 53 | 29200 | 139386 | 0.7182376 |
Brestot | 27110 | 27 | 28 | 27350 | 518 | 0.6957980 |
Esboz-Brest | 70216 | 70 | 27 | 70300 | 485 | 0.4918154 |
4.6.3 Ok, but why?
Here, we are calling an API which returns a JSON object that can be easily turned into a df with {jsonlite}. So we choose to use map_df
to return a simple data.frame of the three results.