Chapter 4 Text Mining
4.1 Build regex
4.1.1 I want to…
Automatically build a regex with or |
4.1.2 Here’s how to:
regex_build <- function(...){
reduce(list(...), ~ paste(.x, .y, sep = "|"))
}
regex_build("one","two","three", "four")
## [1] "one|two|three|four"
4.1.3 Ok, but why?
reduce
turns a list into one value by recursively applying a binary function to the list. In other words, reduce(list("one","two","three"), ~ paste(.x, .y, sep = "|"))
does paste("one", paste("two", paste("three", "four", sep = "|"), sep = "|"), sep = "|")
.
4.1.4 See also
4.2 Regex extraction
4.2.1 I want to…
Extract the elements from a list of words that match a regex
4.2.2 Here’s how to:
# Random words from https://www.randomlists.com/random-words
words <- c("copper", "explain", "ill-fated", "truck", "neat","unite","branch","educated","tenuous", "hum","decisive","notice")
# Extract words that end with a "e"
my_regex <- "e$"
keep(words, ~ grepl(my_regex, .x))
## [1] "unite" "decisive" "notice"
4.2.3 Ok, but why?
keep
only keeps the element matching the predicate. Here, we’re using grepl
, that returns TRUE or FALSE if the expression is found in the string.
4.2.4 See also
4.3 Collapse a list of words
Given the following list of hashtags:
4.3.1 I want to…
Turn my list of words into a simple vector.
4.3.2 Here’s how to:
## [1] "#RStats #Datascience #RStats #BigData"
4.3.3 Ok, but why?
Simplify turns a list of n*m elements into a vector of lenght n*m. We then paste all the elements together, using the collapse
argument from paste
.