查找近重复记录的技术 [英] Techniques for finding near duplicate records

查看:125
本文介绍了查找近重复记录的技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试清理数据库,这些数据库多年来已经获得了许多重复的记录,名称稍有不同。例如,在公司表格中,有Some Company Limited和SOME COMPANY LTD!等名称。



我的计划是将违规表导出为R ,将名称转换为小写,替换常用的同义词(例如限制 - >ltd),删除非字母字符,然后使用 agrep 查看看起来相似



我的第一个问题是 agrep 只接受一个匹配的模式,并将每个公司名称循环与其他人的比赛很慢。 (有些要清理的表将有几十个,可能需要数十万个名字进行检查。)



我非常简单地看了一下 tm 包( JSS文章),似乎非常强大,但面向分析大块文本,而不仅仅是名字。



我有几个相关问题:


  1. 是否适合此类任务的 tm 包?


  2. 有没有比 agrep 更快的选择? (说的功能使用
    Levenshtein编辑距离是非常缓慢的。)


  3. 除了$ $ c $还有其他适合的工具c> agrep 和 tm


  4. 在R中,还是应该这样做的东西是
    直接在数据库中完成? (这是一个Access数据库,所以我可以
    ,而不是尽可能地触摸它。)



解决方案

如果你只是做一些比较完善的小批量,那么 compare.linkage() compare.dedup()函数在 RecordLinkage 应该是一个很好的起点。但是,如果你有大批量,那么你可能需要做更多的修补。



我使用函数 jarowinkler() levenshteinSim() soundex() RecordLinkage 中编写我自己的使用自己的加权方案的函数(也就是它是,您不能对 RecordLinkage )的大数据集使用 soundex()



如果我有两个我想要匹配的名称列表(记录链接),那么我通常将两者都转换为小写,并删除所有标点符号。为了照顾有限对LTD,我通常会创建每个列表中第一个单词的另一个向量,这样可以对第一个单词进行额外的加权。如果我认为一个列表可能包含首字母缩略词(可能是ATT或IBM),那么我将首字母缩写为另一个列表。对于每个列表,我最终得到一个我想比较的字符串的数据框,我在MySQL数据库中作为单独的表写。



所以我不最终得到太多的候选人,我 LEFT OUTER JOIN 这两个表在之间匹配的两个表列表(可能是每个列表中的前三个字母或首字母缩略词中的前三个字母和前三个字母)。然后我使用上述功能计算匹配分数。



您还需要进行大量的手动检查,但您可以对分数进行排序,以便快速排除不匹配。


I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".

My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.

My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)

I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.

I have a few related questions:

  1. Is the tm package appropriate for this sort of task?

  2. Is there a faster alternative to agrep? (Said function uses the Levenshtein edit distance which is anecdotally slow.)

  3. Are there other suitable tools in R, apart from agrep and tm?

  4. Should I even be doing this in R, or should this sort of thing be done directly in the database? (It's an Access database, so I'd rather avoid touching it if possible.)

解决方案

If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.

I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).

If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.

So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.

You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.

这篇关于查找近重复记录的技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆