使用data.table的字符串匹配不完美 [英] Imperfect string match using data.table
问题描述
好,所以不久前我发布了一个问题,该问题涉及编写R函数来加速大型文本文件的字符串匹配。我睁开了眼睛对 data.table,问题得到了很好的回答。
Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly.
这是指向该线程的链接,其中包括所有数据和详细信息:
This is the link to that thread which includes all of the data and details:
但是现在我遇到了另一个问题。有时,提交的VIN#(在 vinDB文件中)在 carFile文件中相差一个或两个字符,这是由于在DMV上填写汽车信息时由于人为错误所致。有没有办法编辑
But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
该代码行(由上述链接中的@BrodieG提供)以允许识别一个或两个字符不同的VIN#?
line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters?
很抱歉,这很容易纠正。我只是对R中 data.table包的强大功能感到不知所措,并希望尽可能多地学习其实用程序,并且该论坛的知识渊博的成员对我来说绝对至关重要。
I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me.
**编辑:
所以我一直在使用'lapply'和'agrep'功能已建议,我必须做错了什么:
So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong:
我尝试替换此行:
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
与此:
dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]
但出现以下错误:
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins, :
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'.
Character columns must join to factor or character columns.
都是'chr'类型,有人知道我为什么会收到此错误吗,我是否正在以正确的方式考虑,即:我在使用l
But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?
谢谢!
推荐答案
我终于明白了! 。
agrep
函数的值
选项需要从 FALSE
(默认)更改为 TRUE
:
The agrep
-function has a value
-option that needs to be altered from FALSE
(default) to TRUE
:
dt <- dt[lapply(car.vins, agrep, x = vin.vins, max.distance = c(cost=2, all=2), value = TRUE)
, .(NumTimesFound = .N)
, by = vin.names]
注意:最大距离参数可以根据Levenshtein距离,替换,删除等进行更改。 agrep是一种引人入胜的功能!
Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!
再次感谢为所有的帮助!
Thanks again for all the help!
这篇关于使用data.table的字符串匹配不完美的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!