Compute string distance the tidy way. Built on top of the stringdist
package.
First, you need to create a tibble with the combinations of words you want to compare. You can do this with the tidy_comb
and tidy_comb_all
functions. The first takes a base word and combines it with each elements of a list or a column of a data.frame, the 2nd combines all the possible couples from a list or a column.
If you already have a data.frame with two columns containing the strings to compare, you can skip this part.
library(tidystringdist)
tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 A B
#> 2 A C
#> 3 B C
tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 setosa versicolor
#> 2 setosa virginica
#> 3 versicolor virginica
tidy_comb("Paris", state.name[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 Alabama Paris
#> 2 Alaska Paris
#> 3 Arizona Paris
Once you’ve got this data.frame, you can use tidy_string_dist()
to compute string distance. This function takes a data.frame, the two columns containing the strings, and one or more stringdist methods.
Note that if you’ve used the tidy_comb
function to create your data.frame, you won’t need to set the column names.
library(dplyr)
data(starwars)
tidy_comb_sw <- tidy_comb_all(starwars, name)
tidy_stringdist(tidy_comb_sw)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, q =
#> q, : Non-printable ascii or non-ascii characters in soundex. Results may be
#> unreliable. See ?printable_ascii.
#> # A tibble: 3,741 x 12
#> V1 V2 osa lv dl hamming lcs qgram cosine jaccard jw
#> * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Luke… C-3PO 14 14 14 Inf 19 19 1 1 1
#> 2 Luke… R2-D2 14 14 14 Inf 19 19 1 1 1
#> 3 Luke… Dart… 11 11 11 Inf 17 17 0.615 0.75 0.575
#> 4 Luke… Leia… 11 11 11 Inf 17 15 0.586 0.667 0.534
#> 5 Luke… Owen… 12 12 12 Inf 15 11 0.503 0.571 0.462
#> 6 Luke… Beru… 16 16 16 Inf 22 18 0.517 0.667 0.466
#> 7 Luke… R5-D4 14 14 14 Inf 19 19 1 1 1
#> 8 Luke… Bigg… 13 13 13 Inf 21 19 0.590 0.667 0.573
#> 9 Luke… Obi-… 14 14 14 14 24 22 0.809 0.842 0.635
#> 10 Luke… Anak… 5 5 5 Inf 8 8 0.206 0.357 0.282
#> # ... with 3,731 more rows, and 1 more variable: soundex <dbl>
Default call compute all the methods. You can use specific method with the method
argument:
tidy_stringdist(tidy_comb_sw, method = c("osa","jw"))
#> # A tibble: 3,741 x 4
#> V1 V2 osa jw
#> * <chr> <chr> <dbl> <dbl>
#> 1 Luke Skywalker C-3PO 14 1
#> 2 Luke Skywalker R2-D2 14 1
#> 3 Luke Skywalker Darth Vader 11 0.575
#> 4 Luke Skywalker Leia Organa 11 0.534
#> 5 Luke Skywalker Owen Lars 12 0.462
#> 6 Luke Skywalker Beru Whitesun lars 16 0.466
#> 7 Luke Skywalker R5-D4 14 1
#> 8 Luke Skywalker Biggs Darklighter 13 0.573
#> 9 Luke Skywalker Obi-Wan Kenobi 14 0.635
#> 10 Luke Skywalker Anakin Skywalker 5 0.282
#> # ... with 3,731 more rows
The goal is to provide a convenient interface to work with other tools from the tidyverse.
tidy_stringdist(tidy_comb_sw, method= "osa") %>%
filter(osa > 20) %>%
arrange(desc(osa))
#> # A tibble: 11 x 3
#> V1 V2 osa
#> <chr> <chr> <dbl>
#> 1 C-3PO Jabba Desilijic Tiure 21
#> 2 C-3PO Wicket Systri Warrick 21
#> 3 R2-D2 Wicket Systri Warrick 21
#> 4 R5-D4 Wicket Systri Warrick 21
#> 5 Jabba Desilijic Tiure IG-88 21
#> 6 Jabba Desilijic Tiure Cordé 21
#> 7 Jabba Desilijic Tiure R4-P17 21
#> 8 Jabba Desilijic Tiure BB8 21
#> 9 IG-88 Wicket Systri Warrick 21
#> 10 Wicket Systri Warrick R4-P17 21
#> 11 Wicket Systri Warrick BB8 21
starwars %>%
filter(species == "Droid") %>%
tidy_comb_all(name) %>%
tidy_stringdist() %>%
summarise_if(is.numeric, mean)
#> # A tibble: 1 x 10
#> osa lv dl hamming lcs qgram cosine jaccard jw soundex
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4.4 4.4 4.4 Inf 7.4 7.4 0.830 0.867 0.642 0.9
Questions and feedbacks welcome!