计算字符串的相似分数时,样本量较大的有效方法是什么? [英] Efficient way of calculating likeness scores of strings when sample size is large?

查看:109
本文介绍了计算字符串的相似分数时,样本量较大的有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设你有10000电子邮件地址的列表,你想查找一些在此列表中最接近的邻居是 - 定义为是可疑接近其他电子邮件地址在列表中的电子邮件地址

Let's say that you have a list of 10,000 email addresses, and you'd like to find what some of the closest "neighbors" in this list are - defined as email addresses that are suspiciously close to other email addresses in your list.

我知道如何计算两个字符串之间的 Levenshtein距离(感谢到<一个href="http://stackoverflow.com/questions/797688/how-to-determine-a-strings-dna-for-likeness-to-another">this问题),这将给我需要多少操作,将一种字符串转换成另一种得分。

I'm aware of how to calculate the Levenshtein distance between two strings (thanks to this question), which will give me a score of how many operations are needed to transform one string into another.

让我们说,我定义可疑接近另一个电子邮件地址为具有莱文斯坦两串得分小于N。

Let's say that I define "suspiciously close to another email address" as two strings having a Levenshtein score less than N.

有没有更有效的方式来找到对的字符串,其得分比除每一个可能比较字符串列表中的其他所有可能的字符串,这个阈值吗?换句话说,可这类问题比解决更快为O(n ^ 2)

Is there a more efficient way to find pairs of strings whose score is lower than this threshold besides comparing every possible string to every other possible string in the list? In other words, can this type of problem be solved quicker than O(n^2)?

时的莱文斯坦得分算法选择不当这个问题?

Is Levenshtein score a poor choice of algorithms for this problem?

推荐答案

修改通过j_random_hacker:不幸的是,IM pressive O(log n)的下方提出申索是不正确。该BK73论文被引用在链接的文章证明了O(log n)的查找时间只为0级(精确匹配)查询。他们不试图分析近似匹配查询,只给一些实验结果。而B-YN98文中还引用链接的文章提供了在其附录中的分析显示,对于非精确查询,BK-树是为O(n ^ a)有些0℃ A&LT; 1这是任何一对对象之间的预期距离的函数 - 再次的不是的O(log n)的。 (我通常不这样做,但这是接受和最高投票的答案,我的意见解释情况给笔者现在已经忽略了两个星期。BK-树是不一样神奇的作者声称,但他们仍然是一个有趣和有价值的技术字符串匹配。)

EDIT by j_random_hacker: Unfortunately, the impressive O(log n) claim made below is incorrect. The BK73 paper cited in the linked article proves an O(log n) lookup time only for "Class 0" (exact-match) queries. They don't attempt to analyse approximate-match queries, only giving some experimental results. The B-YN98 paper also cited in the linked article gives an analysis in its appendix showing that for non-exact queries, BK-trees are O(n^a) for some 0 < a < 1 which is a function of the expected distance between any pair of objects -- again, not O(log n). (I don't normally do this, but this is the accepted and highest-voted answer, and my comments explaining the situation to the author have now been ignored for two weeks. BK-Trees aren't as magical as the author claims, but they are still an interesting and worthwhile technique for string matching.)

是的 - 你可以找到一个字符串为O一给定距离内的所有字符串使用的 BK-树。包括生成每个字符串距离N备用解决方案可能会更快为1 Levenshtein距离,但工作量迅速气球失控更长的距离。

Yup - you can find all strings within a given distance of a string in O(log n) time by using a BK-Tree. Alternate solutions involving generating every string with distance n may be faster for a levenshtein distance of 1, but the amount of work rapidly balloons out of control for longer distances.

这篇关于计算字符串的相似分数时,样本量较大的有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆