在R - 最快的方式成对比较字符串的相似性 [英] In R - fastest way pairwise comparing character strings on similarity

查看:417
本文介绍了在R - 最快的方式成对比较字符串的相似性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种加快以下方法的方法。任何指针都非常受欢迎。瓶颈在哪里?



说我有以下 data.frame

  df<  -  data.frame(names = c(A ADAM,S BEAN,A APPLE,J BOND,J BOND),
v1 = c(Test_a,Test_b,Test_a,Test_b,Test_b),
v2 = c(Test_c,Test_c,Test_d ,Test_d))

我想比较 df 在他们的JaroWinkler相似之处。



有些人的帮助(看到这篇文章),我已经能够构建这个代码:

  #columns来比较
testCols< - c(names,v1,v2)

#compare pairs
RowCompare = function(x){
comp < - NULL
pairs< - t(combn(nrow(x),2))
for(i in 1:nrow(pairs)) {
row_a< - pair [i,1]
row_b< - pairs [i,2]
a_tests< - x [row_a,testCols]
b_tests< - x [row_b,testCols]
comp< - rbind(comp,c(row_a,row_b,TestsCompare(a_tests,b_tests)))
}

colnames(comp)< - c(row_a,row_b,names_j,v1_j,v2_j)
return(comp)
}

#define TestsCompare
TestsCompare = function(x,y){
names_j< - stringdist(x $ names,y $ names,method =jw)
v1_j< -stringdist (x $ v1,y $ v1,method =jw)
v2_j< -stringdist(x $ v2,y $ v2,method =jw)
c(names_j,v1_j,v2_j)
}

这会产生正确的输出:

  output = as.data.frame(RowCompare(df))

>输出
row_a row_b names_j v1_j v2_j
1 1 2 0.4444444 0.1111111 0.0000000
2 1 3 0.3571429 0.0000000 0.1111111
3 1 4 0.4444444 0.1111111 0.1111111
4 1 5 0.4444444 0.1111111 0.1111111
5 2 3 0.4603175 0.1111111 0.1111111
6 2 4 0.3333333 0.0000000 0.1111111
7 2 5 0.3333333 0.0000000 0.1111111
8 3 4 0.5634921 0.1111111 0.0000000
9 3 5 0.5634921 0.1111111 0.0000000
10 4 5 0.0000000 0.0000000 0.0000000

然而,我的真实数据框架有800万次观察我进行了17次比较。要运行此代码需要几天...



我正在寻找加速此过程的方法:




  • 我应该使用矩阵而不是data.frames吗?

  • 如何并行化这个过程?

  • 矢量化? >

解决方案

如果您迭代要检查的变量,可以为每个变量创建一个距离矩阵 stringdist :: stringdistmatrix 。使用 lapply purrr :: map 的形式将返回一个距离矩阵列表(每列一个),您可以依次迭代到cal broom :: tidy ,这将使它们变成格式很好的data.frames。如果您使用 purrr :: map_df 并使用其 .id 参数,则会将结果强制转换为一个大数据。框架,并且每个列表元素的名称将被添加为新列,以便您可以保持它们的直线。所得到的数据框架将处于较长的格式,因此如果您希望与上述结果匹配,请使用 tidyr :: spread 重新整理。



如果正如您在评论中提到的那样,您想为不同的变量使用不同的方法,请与 map2 地图



总而言之,

  library(tidyverse)

map2(df,c('soundex','jw','jw'),〜stringdist :: stringdistmatrix(.x,method = .y))%>%
map_df(broom :: tidy,.id ='var')%>%
spread(var,distance)

# #item1 item2 name v1 v2
## 1 2 1 1 0.1111111 0.0000000
## 2 3 1 1 0.0000000 0.1111111
## 3 3 2 1 0.1111111 0.1111111
## 4 4 1 1 0.1111111 0.1111111
## 5 4 2 1 0.0000000 0.1111111
## 6 4 3 1 0.1111111 0.0000000
## 7 5 1 1 0.1111111 0.1111111
## 8 5 2 1 0.0000000 0.1111111
## 9 5 3 1 0.1111111 0.0000000
## 10 5 4 0 0.0000000 0.0000000

请注意,虽然选择(5,2)返回10个观察值,选择(8000000,2)返回3.2e + 13(32 万亿)观察结果,所以为了实际的目的,即使这样将比现有的代码更快速地运行(而且 stringdistmatrix 在可能的情况下进行一些并行化),除非您只在子集上工作,否则数据将变得非常大。

I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks?

Say I have the following data.frame:

df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), 
                      v1=c("Test_a", "Test_b", "Test_a", "Test_b", "Test_b"), 
                      v2=c("Test_c", "Test_c", "Test_d", "Test_d", "Test_d"))

I want to compare each pair of rows in df on their JaroWinkler similarity.

With some help of others (see this post), I've been able to construct this code:

#columns to compare 
testCols <- c("names", "v1", "v2")

#compare pairs
RowCompare= function(x){
 comp <- NULL
 pairs <- t(combn(nrow(x),2))
 for(i in 1:nrow(pairs)){
   row_a <- pairs[i,1]
   row_b <- pairs[i,2]
   a_tests <- x[row_a,testCols]
   b_tests <- x[row_b,testCols]
 comp <- rbind(comp, c(row_a, row_b, TestsCompare(a_tests, b_tests)))
 }

colnames(comp) <- c("row_a","row_b","names_j","v1_j","v2_j")
return(comp)
}

#define TestsCompare
TestsCompare=function(x,y){
names_j <- stringdist(x$names, y$names, method = "jw")
v1_j <-stringdist(x$v1, y$v1, method = "jw")
v2_j <-stringdist(x$v2, y$v2, method = "jw")
c(names_j,v1_j, v2_j)
}

This generates the correct output:

output = as.data.frame(RowCompare(df))

> output
   row_a row_b   names_j      v1_j      v2_j
1      1     2 0.4444444 0.1111111 0.0000000
2      1     3 0.3571429 0.0000000 0.1111111
3      1     4 0.4444444 0.1111111 0.1111111
4      1     5 0.4444444 0.1111111 0.1111111  
5      2     3 0.4603175 0.1111111 0.1111111
6      2     4 0.3333333 0.0000000 0.1111111
7      2     5 0.3333333 0.0000000 0.1111111
8      3     4 0.5634921 0.1111111 0.0000000
9      3     5 0.5634921 0.1111111 0.0000000
10     4     5 0.0000000 0.0000000 0.0000000

However, my real data.frame has 8 million observations and I make 17 comparisons. To run this code takes days...

I am looking for ways to speed up this process:

  • Should I use matrices instead of data.frames?
  • How to parallelize this process?
  • Vectorize?

解决方案

If you iterate over the variables you want to check, you can make a distance matrix for each with stringdist::stringdistmatrix. Using a form of lapply or purrr::map will return a list of distance matrices (one for each column), which you can in turn iterate over to cal broom::tidy, which will turn them into nicely formatted data.frames. If you use purrr::map_df and use its .id parameter, the results will be coerced into one big data.frame, and the name of each list element will be added as a new column so you can keep them straight. The resulting data.frame will be in long form, so if you want it to match the results above, reshape with tidyr::spread.

If, as you mentioned in the comments, you want to use different methods for different variables, iterate in parallel with map2 or Map.

Altogether,

library(tidyverse)

map2(df, c('soundex', 'jw', 'jw'), ~stringdist::stringdistmatrix(.x, method = .y)) %>% 
    map_df(broom::tidy, .id = 'var') %>% 
    spread(var, distance)

##    item1 item2 names        v1        v2
## 1      2     1     1 0.1111111 0.0000000
## 2      3     1     1 0.0000000 0.1111111
## 3      3     2     1 0.1111111 0.1111111
## 4      4     1     1 0.1111111 0.1111111
## 5      4     2     1 0.0000000 0.1111111
## 6      4     3     1 0.1111111 0.0000000
## 7      5     1     1 0.1111111 0.1111111
## 8      5     2     1 0.0000000 0.1111111
## 9      5     3     1 0.1111111 0.0000000
## 10     5     4     0 0.0000000 0.0000000

Note that while choose(5, 2) returns 10 observations, choose(8000000, 2) returns 3.2e+13 (32 trillion) observations, so for practical purposes, even though this will work much more quickly than your existing code (and stringdistmatrix does some parallelization when possible), the data will get prohibitively big unless you are only working on subsets.

这篇关于在R - 最快的方式成对比较字符串的相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆