Chapter 4 Text Mining

4.1 Build regex

4.1.1 I want to…

Automatically build a regex with or |

4.1.2 Here’s how to:

regex_build <- function(...){
    reduce(list(...), ~ paste(.x, .y, sep = "|"))
}
regex_build("one","two","three", "four")

## [1] "one|two|three|four"

4.1.3 Ok, but why?

reduce turns a list into one value by recursively applying a binary function to the list. In other words, reduce(list("one","two","three"), ~ paste(.x, .y, sep = "|")) does paste("one", paste("two", paste("three", "four", sep = "|"), sep = "|"), sep = "|").

4.1.4 See also

4.2 Regex extraction

4.2.1 I want to…

Extract the elements from a list of words that match a regex

4.2.2 Here’s how to:

# Random words from https://www.randomlists.com/random-words
words <- c("copper", "explain", "ill-fated", "truck", "neat","unite","branch","educated","tenuous", "hum","decisive","notice")
# Extract words that end with a "e"
my_regex <- "e$"
keep(words, ~ grepl(my_regex, .x))

## [1] "unite"    "decisive" "notice"

4.2.3 Ok, but why?

keep only keeps the element matching the predicate. Here, we’re using grepl, that returns TRUE or FALSE if the expression is found in the string.

4.2.4 See also

4.3 Collapse a list of words

Given the following list of hashtags:

hash <- list(tweet1 = c("#RStats", "#Datascience"), tweet2 = c("#RStats", "#BigData"))

4.3.1 I want to…

Turn my list of words into a simple vector.

4.3.2 Here’s how to:

simplify(hash) %>% paste(collapse = " ")

## [1] "#RStats #Datascience #RStats #BigData"

4.3.3 Ok, but why?

Simplify turns a list of n*m elements into a vector of lenght n*m. We then paste all the elements together, using the collapse argument from paste.

purrr cookbook