在两个大字符串列表中查找匹配项 [英] Finding Matches in Two Large Lists of Strings

查看:102
本文介绍了在两个大字符串列表中查找匹配项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有两个长(> 50000)的名称列表,必须定期检查它们之间可能的匹配.我已经编写了几种用于计算编辑距离的快速算法(例如Levenshtein),但是列表的大小仍然使它非常耗时.

人们是否可以对单个字符串进行离线计算的快速函数F(S)进行聚类,因此F中距离较远的两个字符串的距离也很远,可以减小必须精确检查的集合的大小?例如,如果我的匹配标准是Leven(s1,s2)< N,那么我知道| Length(s1)-Length(s2)| < N,并且如果我发现长度差异> = N,我什至不理会运行Levenshtein.

我曾经用Google搜索过,并提出了使用希尔伯特曲线或Z轴排序的建议,这使我很头疼.但是总有一些比长度更好的东西了.

谢谢百万!

Hi,

I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I''ve coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly.

Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won''t even bother running Levenshtein.

I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there''s gotta be something better than just length...

Thanks a million!

推荐答案

有一个更简单的解决方案:使用哈希表.

将较大列表中的所有字符串插入到哈希表中,该哈希表的插槽至少是字符串的两倍.然后检查较小列表中的每个字符串是否在哈希表中.

哈希表以空间为代价,速度非常快.如果要提高速度,只需增加表中的插槽数即可.
There''s a much simpler solution: Use a hash table.

Insert all strings from the larger list into a hash table with at least twice as many slots as strings. Then check to see if each string from the smaller list is in the hash table.

Hash tables trade space for speed and are very fast. If you want more speed, just increase the number of slots in the table.


这篇关于在两个大字符串列表中查找匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆