查找接近重复记录的技术 [英] Techniques for finding near duplicate records

查看:22
本文介绍了查找接近重复记录的技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试清理一个数据库,该数据库多年来获得了许多重复记录,名称略有不同.例如,在公司表中,有Some Company Limited"和SOME COMPANY LTD!"之类的名称.

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".

我的计划是将有问题的表导出到 R 中,将名称转换为小写,替换常见的同义词(如limited"->ltd"),去掉非字母字符,然后使用 agrep 看看有什么相似之处.

My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.

我的第一个问题是 agrep 只接受一个模式来匹配,并且循环遍历每个公司名称以匹配其他公司名称很慢.(某些要清理的表将有数万甚至数十万个名称要检查.)

My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)

我已经非常简要地查看了 tm 包(JSS 文章),它看起来非常强大,但适用于分析大量文本,而不仅仅是名称.

I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.

我有几个相关的问题:

  1. tm 包适合这类任务吗?

有没有比 agrep 更快的替代方法?(所述功能使用Levenshtein 编辑距离非常慢.)

Is there a faster alternative to agrep? (Said function uses the Levenshtein edit distance which is anecdotally slow.)

除了 agreptm 之外,R 中还有其他合适的工具吗?

Are there other suitable tools in R, apart from agrep and tm?

我是否应该在 R 中这样做,或者这种事情应该是直接在数据库中完成?(这是一个 Access 数据库,所以我会尽量避免触摸它.)

Should I even be doing this in R, or should this sort of thing be done directly in the database? (It's an Access database, so I'd rather avoid touching it if possible.)

推荐答案

如果你只是做小批量,格式相对较好,那么 compare.linkage()RecordLinkage 包中的 compare.dedup() 函数 应该是一个很好的起点.但是如果你有大批量,那么你可能需要做更多的修补.

If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.

我使用 RecordLinkage 中的函数 jarowinkler()levenshteinSim()soundex() 来编写我自己的函数,使用我自己的加权方案(同样,您不能将 soundex() 用于带有 RecordLinkage 的大数据集).

I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).

如果我有两个想要匹配的名称列表(记录链接"),那么我通常将两者都转换为小写并删除所有标点符号.为了处理Limited"与LTD",我通常从每个列表中创建第一个单词的另一个向量,这允许对第一个单词进行额外的加权.如果我认为一个列表可能包含首字母缩略词(可能是 ATT 或 IBM),那么我将对另一个列表进行首字母缩略词.对于每个列表,我最终都会得到一个字符串数据框,我想将其作为 MySQL 数据库中的单独表进行比较.

If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.

为了不让我得到太多候选者,我LEFT OUTER JOIN这两个表在必须匹配的东西上在两个列表之间(可能是每个列表中的前三个字母或首字母缩写词中的前三个字母和前三个字母).然后我使用上述函数计算匹配分数.

So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.

您仍然需要进行大量人工检查,但您可以根据分数进行排序以快速排除不匹配的情况.

You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.

这篇关于查找接近重复记录的技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆