模糊匹配重复数据删除小于指数时间? [英] Fuzzy matching deduplication in less than exponential time?

查看:181
本文介绍了模糊匹配重复数据删除小于指数时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个较大的数据库(可能在数百万条记录中),字符串相对较短(按街道地址,名称等)。



我正在寻找一个删除不精确重复的策略,而模糊匹配似乎是选择的方法。我的问题:许多文章和SO问题涉及将单个字符串与数据库中的所有记录进行匹配。我正在寻找一次重复数据删除整个数据库。前者将是线性时间问题(将值与百万其他值进行比较,每次计算一些相似性度量)。后者是一个指数时间问题(比较每个记录的价值与每个其他记录的价值;一百万条记录,大约5 x 10 ^ 11的计算与前一个选项的1,000,000计算)。



我想知道是否有另一种方法,而不是我提到的强力方法。我正在想可能生成一个字符串来比较每个记录的值,然后将具有大致相等的相似性度量的字符串组合,然后通过这些组运行强力方法。我不会实现线性时间,但它可能有帮助。此外,如果我正确地思考,这可能会错过字符串A和B之间的潜在模糊匹配,因为它们与字符串C(生成的检查字符串)的相似性是非常不同的,尽管彼此非常相似。 p>

任何想法?



PS我意识到我可能使用错误的术语来实现时间复杂性 - 这是一个概念,我有一个基本的掌握,但不够好,所以我可以将一个算法放在正确的类别当中。如果我使用这些术语错误,我欢迎更正,但希望至少得到我的意见。



修改 p>

有些评论者已经提出了记录之间的模糊匹配,我的策略是选择哪些删除(即给出foo,boo和coo这将被标记为重复和删除)。我应该注意,我不是在这里寻找一个自动删除。这个想法是在60多万条记录数据库中标记潜在的重复数据,以进行人员审查和评估。如果有一些假阳性,只要是大致可预测/一致的数量就可以了。我只需要弄清楚重复的是多么普及。但是,如果模糊匹配传递需要一个月的时间,这首先不是一个选项。

解决方案

看看 http://en.wikipedia.org/wiki/Locality-sensitive_hashing。一个非常简单的方法是将每个地址(或任何)分成一组重叠的n-gram。这个STACKOVERFLOW成为集合{STACKO,TACKO,ACKOV,CKOVE ...,RFLOW}。然后使用一个大的散列表或排序合并来查找碰撞的n-gram并检查与模糊匹配器的碰撞。因此,STACKOVERFLOW和SXACKOVRVLOX将发生冲突,因为它们都与冲突的n-gram ACKOV相关联。



复杂的下一个级别是选择随机哈希函数。 HMAC具有任意键,并且您发现的n-gram只保留最小散列值的那个。那么你必须跟踪较少的n-gram,但是如果两种情况下的最小哈希值都是ACKOV,则只能看到一个匹配项。这显然是在n-gram的长度和假命中的概率之间进行权衡。事实上,人们似乎做的是通过将相同记录中的多个哈希函数的结果连接在一起,使n相当小,并获得更高的精度,因此您需要在多个不同的哈希函数中同时获得匹配 - 我假设这种概率更好。尝试谷歌搜索重复检测minhash


I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.

The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).

I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.

Any ideas?

P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.

Edit

Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.

解决方案

Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.

A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try "googling for duplicate detection minhash"

这篇关于模糊匹配重复数据删除小于指数时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆