使用 data.table 进行不完美的字符串匹配 [英] Imperfect string match using data.table

查看：10 发布时间：2022/1/13 19:35:44 string r performance text data.table

本文介绍了使用 data.table 进行不完美的字符串匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

好的，所以我不久前发布了一个关于编写 R 函数以加速大型文本文件的字符串匹配的问题.我睁大眼睛看到data.table"，我的问题得到了完美的回答.

Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly.

这是该线程的链接，其中包含所有数据和详细信息:

This is the link to that thread which includes all of the data and details:

加速 R 中字符串匹配的性能和速度

但现在我遇到了另一个问题.有时，由于在 DMV 填写汽车信息时的人为错误，提交的 VIN#s(在vinDB"文件中)与carFile"文件中的一个或两个字符不同.有没有办法编辑

But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

该代码行(由上述链接中的@BrodieG 提供)以允许识别相差一两个字符的 VIN#?

line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters?

如果这是一个简单的更正，我深表歉意.我对 R 中的data.table"包的强大功能感到不知所措，并且很想尽可能多地了解它的实用性，而且这个论坛的知识渊博的成员对我来说绝对是至关重要的.

I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me.

所以我一直在按照建议使用lapply"和agrep"函数，我一定是做错了什么:

So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong:

我尝试替换这一行:

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

用这个:

dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]

但出现以下错误:

Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins,  : 
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'. 
Character columns must join to factor or character columns.

但它们都是chr"类型.有谁知道我为什么会收到这个错误?我是否以正确的方式考虑这一点，即:我在这里正确使用 lapply 吗?

But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?

谢谢！

推荐答案

我终于明白了.

agrep-函数有一个value-选项，需要从FALSE(默认)更改为TRUE:


The agrep-function has a value-option that needs to be altered from FALSE (default) to TRUE: 
dt <- dt[lapply(car.vins, agrep, x = vin.vins, max.distance = c(cost=2, all=2), value = TRUE)
         , .(NumTimesFound = .N)
         , by = vin.names]

注意:max.distance 参数可以根据 Levenshtein 距离、替换、删除等进行更改.agrep"是一个迷人的功能！
Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!
再次感谢大家的帮助！

                        这篇关于使用 data.table 进行不完美的字符串匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用 data.table 进行不完美的字符串匹配 [英] Imperfect string match using data.table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 data.table 进行不完美的字符串匹配 [英] Imperfect string match using data.table

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭