在两个大字符串列表中查找匹配项 [英] Finding Matches in Two Large Lists of Strings

查看：102 发布时间：2019/6/22 15:47:28

本文介绍了在两个大字符串列表中查找匹配项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个长(> 50000)的名称列表，必须定期检查它们之间可能的匹配.我已经编写了几种用于计算编辑距离的快速算法(例如Levenshtein)，但是列表的大小仍然使它非常耗时.

人们是否可以对单个字符串进行离线计算的快速函数F(S)进行聚类，因此F中距离较远的两个字符串的距离也很远，可以减小必须精确检查的集合的大小?例如，如果我的匹配标准是Leven(s1，s2)< N，那么我知道| Length(s1)-Length(s2)| < N，并且如果我发现长度差异> = N，我什至不理会运行Levenshtein.

我曾经用Google搜索过，并提出了使用希尔伯特曲线或Z轴排序的建议，这使我很头疼.但是总有一些比长度更好的东西了.

谢谢百万！

Hi,

I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I''ve coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly.

Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won''t even bother running Levenshtein.

I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there''s gotta be something better than just length...

Thanks a million!

在两个大字符串列表中查找匹配项 [英] Finding Matches in Two Large Lists of Strings

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

在两个大字符串列表中查找匹配项 [英] Finding Matches in Two Large Lists of Strings

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭