Getting_started.Rmd
{tidystringdist}
is a package that extends the {stringdist} package with tidy data principles.
The idea is to perform string distance calculation and combine it with functions for data manipulation and visualisation from the tidyverse framework.
You can install the last stable version from GitHub with:
Or the dev version from GitHub:
tidycomb()
The tidycomb()
& tidy_comb_all()
functions return all the possible combinations from a vector / a data.frame and a column / two vectors:
library(tidystringdist)
tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 A B
#> 2 A C
#> 3 B C
tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 setosa versicolor
#> 2 setosa virginica
#> 3 versicolor virginica
tidy_comb("Paris", state.name)
#> # A tibble: 50 x 2
#> V1 V2
#> * <chr> <chr>
#> 1 Alabama Paris
#> 2 Alaska Paris
#> 3 Arizona Paris
#> 4 Arkansas Paris
#> 5 California Paris
#> 6 Colorado Paris
#> 7 Connecticut Paris
#> 8 Delaware Paris
#> 9 Florida Paris
#> 10 Georgia Paris
#> # ... with 40 more rows
Once you’ve got this data.frame, you can use tidy_string_dist()
to compute string distance. This function takes a data.frame, the two columns containing the strings, and one or more stringdist methods.
comb <- tidy_comb_all(state.name)
tidy_stringdist(comb)
#> # A tibble: 1,225 x 12
#> V1 V2 osa lv dl hamming lcs qgram cosine jaccard jw
#> * <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alab… Alas… 3 3 3 Inf 5 5 0.216 0.571 0.254
#> 2 Alab… Ariz… 5 5 5 5 10 10 0.581 0.8 0.476
#> 3 Alab… Arka… 6 6 6 Inf 9 9 0.440 0.778 0.399
#> 4 Alab… Cali… 8 8 8 Inf 13 11 0.481 0.818 0.535
#> 5 Alab… Colo… 6 6 6 Inf 11 11 0.704 0.778 0.488
#> 6 Alab… Conn… 11 11 11 Inf 18 18 1 1 1
#> 7 Alab… Dela… 5 5 5 Inf 9 9 0.440 0.778 0.399
#> 8 Alab… Flor… 5 5 5 5 10 10 0.581 0.8 0.476
#> 9 Alab… Geor… 6 6 6 6 12 12 0.686 0.909 0.571
#> 10 Alab… Hawa… 5 5 5 Inf 9 9 0.474 0.875 0.460
#> # ... with 1,215 more rows, and 1 more variable: soundex <dbl>
Default call compute all the methods. You can use specific method with the method
argument:
comb <- tidy_comb_all(state.name)
tidy_stringdist(comb, method = c("osa","jw"))
#> # A tibble: 1,225 x 4
#> V1 V2 osa jw
#> * <chr> <chr> <dbl> <dbl>
#> 1 Alabama Alaska 3 0.254
#> 2 Alabama Arizona 5 0.476
#> 3 Alabama Arkansas 6 0.399
#> 4 Alabama California 8 0.535
#> 5 Alabama Colorado 6 0.488
#> 6 Alabama Connecticut 11 1
#> 7 Alabama Delaware 5 0.399
#> 8 Alabama Florida 5 0.476
#> 9 Alabama Georgia 6 0.571
#> 10 Alabama Hawaii 5 0.460
#> # ... with 1,215 more rows