在R中的数据框中查找相似的行(不重复) [英] Finding similar rows (not duplicates) in a dataframe in R

查看:317
本文介绍了在R中的数据框中查找相似的行(不重复)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个超过80万行的数据集(示例):

I have a dataset of >800k rows (example):

id     fieldA       fieldB              codeA   codeB
120    Similar one  addrs example1      929292  0006
3490   Similar oh   addrs example3      929292  0006
2012   CLOSE CAA    addrs example10232  kkda9a  0039
9058   CLASE CAC    addrs example01232  kkda9a  0039
9058   NON DONE     addrs example010193 kkda9a  0039
48848  OOO AD ADDD  addrs example18238  uyMMnn  8303

字段ID是唯一的ID,字段codeA和codeB必须相同,但是字段fieldA和fieldB需要Levenshtein距离或类似的函数.我需要找到基于此的行非常相似.输出可能是以下内容的行:

Field ID is an unique id, both fields codeA and codeB must be the same, but the fields fieldA and fieldB need a Levenshtein distance or similar function. I need to find which rows are very similar based on that. The output could be something on the lines of:

   codeA    codeB Similar
   929292   0006  120;3490
   kkda9a   0039  2012;9058
   kkda9a   0039  9058
   uyMMnn   8303  48848

如果我有2个约束,例如codeA和codeB,那么这么大的数据集的距离矩阵将不起作用,也就没有多大意义.我猜一种方法是将plyr函数拆分为codeA-codeB,但是在那之后我被困住了

A distance matrix for a dataset this big wouldn't work and wouldn't make much sense if I have 2 constrainsts like codeA and codeB. I guessing one approach would be a plyr function to split by codeA-codeB, but I'm stuck after that

为澄清起见,我想将在fieldA fieldB中具有高度相似性并且在codeA codeB中具有完全匹配的所有行归为一组.

For clarification, I want to group together all rows that have high similarity in both fieldA and fieldB, and have an exact match in codeA and codeB.

按照David DeWert的想法,对于每个codeA-codeB组来说,这似乎都行得通,而正确的输出似乎不是朝着正确的方向迈进的一步:

Following David DeWert idea, something along this line seem to work for each codeA-codeB group, not a nice output put seems a step in the right way:

library(stringdist)
clustering<-function(x){
  if(nrow(x)>1){{d<-stringdistmatrix(paste(x$fieldA,x$fieldB),paste(x$fieldA,x$fieldB),method = "qgram")
  rownames(d)<-x$id
  hc <- hclust(as.dist(d))
  #I need to evaluate correctly this cutting
  res<-cutree(hc,h=5)
  #This returns a list, one element for each cluster found and a named vector inside with the elements
  return(res)
  }else{
  res<-1
  names(res)<-x$id
  return(res)
  }
}

现在,我需要找到一种将数据帧拆分为codeA-codeB组并将此功能应用于它们的方法.

Now I need to find a way to split the dataframe in codeA-codeB groups and apply this function to them.

我使用以前的功能集群和plyr软件包为这种情况管理了一种足够好"的方法.

I managed a "good enough" approach for this using the previous function clustering and the plyr package.

result<-dlply(testDF,.(codeA,codeB),clustering)

这将创建一个列表,其中每个按代码A,代码B分组",如:

This creates a list with each of the "group by codeA,codeB" like:

$`929292.0006`
 120 3490 
   1    1 

$kkda9a.0039
2012 9058 9058 
   1    1    2 

$uyMMnn.8303
48848 
    1 

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
   codeA codeB
1 929292  0006
2 kkda9a  0039
3 uyMMnn  8303

通过fieldA和fieldB有效地聚集由codeA和codeB创建的组.这没有得到我想要的输出,但是由于我无法获得更好的解决方案,因此必须这样做.我最大的困扰是plyr函数的性质不允许我按组获取多于1行的数据(这是完全有道理的),因此我必须使用list而不是dataframe作为结果,而不是真正的问题.当数据集很大(像这样)并且plyr不能很好地与它们配合使用时,就会出现问题……替代的dplyr包与列表结果不兼容……哦.

Which effectively clusters by fieldA and fieldB the groups created by codeA and codeB. This doesn't get my desired output, but since I can't get a better solution, will have to do. My biggest gripe with this is that the nature of the plyr functions wont allow me to get more than 1 row by group (which makes complete sense) so I have to use list as a result instead of dataframe, not a real concern. The problem arises when the dataset is quite big (like this) and plyr doesn't work very well with them ... and the alternative dplyr package is not compatible with list results... oh well.

推荐答案

创建一个名为"codeAB"的新字段,以根据codeA-codeB匹配对数据进行分区,如下所示:

Create a new field called "codeAB" to partition the data according to the codeA-codeB match like so:

data$codeAB <- factor(apply( data[ , c(4,5) ] , 1 , paste , collapse = "-" ))

然后将levels(data$codeAB)中的每一个与Damerau-Levenshtein聚在一起. 人们似乎在暗示ELKI http://en.wikipedia.org/wiki/ELKI 擅长在不构建距离矩阵的情况下对大型数据集合进行聚类.

Then cluster each of levels(data$codeAB) with Damerau-Levenshtein . People seem to be suggesting that ELKI http://en.wikipedia.org/wiki/ELKI is good at clustering large collections of data without building a distance matrix.

也有人在询问ELKI中的D-L指标: 使用ELKI集群字符串数据

Someone was also asking about D-L metric in ELKI: Clustering string data with ELKI

我希望能有所帮助.

这篇关于在R中的数据框中查找相似的行(不重复)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆