使用R中的data.table不完全的字符串匹配 [英] Imperfect string match using data.table in R

查看：635 发布时间：2017/3/12 12:44:54 string r performance text data.table

本文介绍了使用R中的data.table不完全的字符串匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

好的，所以我发布了一个问题，回来一个关于写一个R函数来加速大文本文件的字符串匹配。我的眼睛打开了'data.table'，我的问题被完美地回答。

这是指向该主题的链接，其中包含所有数据和详细信息：

在R 中提高字符串匹配的性能和速度

但现在我遇到了另一个问题。有一段时间，提交的VIN＃（在'vinDB'文件中）在'carFile'文件中有一个或两个字符，因为当他们在DMV中填写他们的汽车信息时，由于人为错误。是否有办法编辑

  dt [J（car.vins），list（NumTimesFound = .N） vin.names]

这行代码（由@BrodieG在上述链接中提供） VIN＃的识别差异一个或两个字符？

如果这是一个简单的更正，我们深表歉意。我只是被R的data.table包的力量所压倒，并且希望尽可能多地学习它的实用性，而且这个论坛的知识渊博的成员对我来说是绝对关键的。

**编辑：

所以我一直在使用'lapply'和'agrep'

我尝试替换这行：

  dt [J（car.vins），list（NumTimesFound = .N），by = vin.names]

与此：

  dt <-dt [lapply（vin.vins，function ）agrep（x，car.vins，max.distance = 2）），list（NumTimesFound = .N），vin.names，allow.cartesian = TRUE] 
   错误在`[.data .table`（dt，lapply（vin.vins，function（x）agrep（x，car.vins，：
 x.'vin.vins'是一个字符列连接到i.'V1'类型整数
字符列必须连接到因子或字符列。
  
有没有人知道为什么我得到这个错误？我在想这个正确的方式，即我正在这里使用lapply吗？
 
 
 谢谢！
解决方案
我终于搞定了。 
 
 
 'agrep'函数有一个'value'选项，需要从FALSE（默认值）更改为true：
 > dt <-dt [lapply（car.vins，agrep，x = vin.vins，max.distance = c（cost = 2，all = 2） value = TRUE），list（NumTimesFound = .N），vin.names] 
  
注意：max 。距离参数可以基于Levenshtein距离，替换，删除等来改变。agrep是一个迷人的功能！
 
 
 再次感谢所有的帮助！ 
 
Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly. 

This is the link to that thread which includes all of the data and details: 

Accelerate performance and speed of string match in R

But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the 
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters? 

I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me. 

**EDIT:

So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong: 

I tried replacing this line: 
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
with this:
dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]
But got the following error:
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins,  : 
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'. 
Character columns must join to factor or character columns.
But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?

Thanks!
 解决方案 
I finally got it. 

The 'agrep' function has a 'value' option that needs to be altered from FALSE (default) to true: 
>dt <- dt[lapply(car.vins, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), vin.names]
Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!

Thanks again for all the help! 

                        这篇关于使用R中的data.table不完全的字符串匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用R中的data.table不完全的字符串匹配 [英] Imperfect string match using data.table in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用R中的data.table不完全的字符串匹配 [英] Imperfect string match using data.table in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭