在不到指数的时间内进行模糊匹配重复数据删除? [英] Fuzzy matching deduplication in less than exponential time?

查看:19
本文介绍了在不到指数的时间内进行模糊匹配重复数据删除?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据库(可能有数百万条记录),其中包含相对较短的文本字符串(按街道地址、名称等顺序).

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

我正在寻找一种去除不精确重复的策略,模糊匹配似乎是首选方法.我的问题:许多文章和 SO 问题都涉及将单个字符串与数据库中的所有记录进行匹配.我希望立即对整个数据库进行重复数据删除.

I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.

前者将是一个线性时间问题(将一个值与一百万个其他值进行比较,每次都计算一些相似性度量).后者是指数时间问题(将每条记录的值与其他每条记录的值进行比较;对于一百万条记录,与前一个选项的 1,000,000 次计算相比,这大约是 5 x 10^11 计算).

The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).

我想知道除了我提到的蛮力"方法之外,是否还有其他方法.我正在考虑可能生成一个字符串来比较每个记录的值,然后对具有大致相等相似性度量的字符串进行分组,然后通过这些组运行蛮力方法.我不会达到线性时间,但它可能会有所帮助.另外,如果我考虑得当,这可能会错过字符串 A 和 B 之间潜在的模糊匹配,因为它们与字符串 C(生成的检查字符串)的相似性非常不同,尽管它们彼此非常相似.

I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.

有什么想法吗?

附注我意识到我可能使用了错误的时间复杂度术语——这是一个我基本掌握的概念,但还不够好,所以我可以当场将算法归入正确的类别.如果我用错了术语,我欢迎更正,但希望我至少能理解我的意思.

编辑

一些评论者问,鉴于记录之间的模糊匹配,我的策略是选择删除哪些记录(即给定foo"、boo"和coo",它们将被标记为重复并删除).我应该注意,我不是在这里寻找自动删除.这个想法是在一个 60 多万条记录数据库中标记潜在的重复项,以供人工审查和评估.如果有一些误报是可以的,只要它是一个大致可预测/一致的数量.我只需要了解重复项的普遍性.但是,如果模糊匹配传递需要一个月的时间来运行,那么这甚至不是一个选项.

Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.

推荐答案

看看 http://en.wikipedia.org/wiki/Locality-sensitive_hashing.一种非常简单的方法是将每个地址(或其他)分成一组重叠的 n-gram.此 STACKOVERFLOW 变为集合 {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}.然后使用大型哈希表或排序合并来查找冲突的 n-gram 并使用模糊匹配器检查冲突.因此,STACKOVERFLOW 和 SXACKOVRVLOX 将发生冲突,因为两者都与冲突的 n-gram ACKOV 相关联.

Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.

更复杂的下一个级别是选择一个随机散列函数 - 例如具有任意键的 HMAC,在您找到的 n-gram 中,只保留散列值最小的那个.然后您必须跟踪更少的 n-gram,但只有在两种情况下的最小散列值都是 ACKOV 时才会看到匹配.在 n-gram 的长度和错误命中的概率之间显然需要权衡取舍.事实上,人们似乎做的是通过将同一记录中多个哈希函数的结果连接起来,使 n 变得非常小并获得更高的精度,因此您需要同时在多个不同的哈希函数中获得匹配 -我认为这样的概率会更好.尝试谷歌搜索重复检测 minhash"

A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"

这篇关于在不到指数的时间内进行模糊匹配重复数据删除?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆