在不到指数的时间模糊匹配重复数据删除? [英] Fuzzy matching deduplication in less than exponential time?

查看:224
本文介绍了在不到指数的时间模糊匹配重复数据删除?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的数据库(可能在数百万条记录)与文本的相对较短的串(街道地址,姓名等的数量级上)。

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

我要寻找一个战略,以消除不准确的重复,模糊匹配似乎是首选的方法。我的问题:许多文章和做题处理对匹配数据库中的所有记录一个字符串。我期待重复数据删除整个数据库一次。

I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.

前者是(A值比对上百万其他值,计算每次的一些相似性度量)的线性时间的问题。后者是一个指数时间问题(对所有其他记录的值进行比较,每一个记录的值;一百万条记录,这是约5×10 ^ 11计算VS 1,000,000计算前选择)。

The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).

我不知道是否有比蛮力的方法我提到的另一种方法。我在想可能生成的字符串来比较每个记录的值,然后组字符串有大致相等的相似性措施,然后再运行通过这些群体的穷举法。我起不到线性时间,但它可能会有所帮助。另外,如果我通过这个正确的思维,这可能会错过字符串A和B之间的潜在模糊匹配,因为他们相似的字符串C(生成的查询字符串)是尽管是非常相似,彼此非常不同的。

I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.

任何想法?

P.S。我意识到我可能使用了错误条款的时间复杂度 - 这是一个概念,我有一个基本的把握,而不是不够好,所以我可能会下降算法插入点的正确分类。如果我使用的术语错误的,我欢迎更正,但我希望我得到了我的观点跨越至少

修改

有些评论者都问,因为纪录后,我的策略是选择要删除的那些之间的模糊匹配(即给定的富,嘘和COO,这将标志着复制和删除)。我要指出,我不是在寻找一个自动删除这里。我们的想法是在为人类审查和评估目的的60+百万记录数据库标记可能的重复。它是好的,如果有一些误报,只要它是一个大约predictable /恒定量。我只需要得到的副本是如何普遍有一个手柄。但是,如果模糊匹配直通需要一个月来运行,这是连在首位的选项。

Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.

推荐答案

看一看<一href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing">http://en.wikipedia.org/wiki/Locality-sensitive_hashing.一个非常简单的方法是将每个地址(或其他)划分为一组重叠正克。这个计算器将成为集合{STACKO,TACKO,ACKOV,CKOVE ...,RFLOW}。然后用一个大的哈希表或排序合并找到碰撞正克和检查碰撞模糊匹配。因此,计算器和SXACKOVRVLOX会发生冲突,因为两者都与碰撞正克ACKOV有关。

Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.

这是一个新的水平在复杂性是选择一个随机哈希函数 - 例如: HMAC具有任意键,及n克你发现,只保留一个具有最小散列值。然后,你必须跟踪更少的n-gram,但只会看到一个匹配,如果在这两种情况下最小的散列值ACKOV。显然有折衷此处所述n-gram的长度和假命中的概率之间。事实上,人们似乎做的是使用n很小,通过从多个散列函数在同一个记录串联的结果获得更高的precision,所以你需要得到在多个不同的散列函数匹配同一时间 - 我presume概率制定出更好的这种方式。尝试使用Google的重复检测minhash

A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try "googling for duplicate detection minhash"

这篇关于在不到指数的时间模糊匹配重复数据删除?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆