提高模糊字符串匹配字典的性能 [英] Improving performance of fuzzy string matching against a dictionary

查看:114
本文介绍了提高模糊字符串匹配字典的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在使用 SecondString 进行模糊字符串匹配,其中我有一个大字典进行比较(字典中的每个条目都有相关联的非唯一标识符)。我目前正在使用一个hashMap来存储这个字典。

So I'm currently working for with using SecondString for fuzzy string matching, where I have a large dictionary to compare to (with each entry in the dictionary has an associated non-unique identifier). I am currently using a hashMap to store this dictionary.

当我想做模糊字符串匹配时,我首先检查字符串是否在hashMap中,然后迭代所有其他潜在的键,计算字符串相似度并存储具有最高相似度的k,v对/ s。根据我使用的字典可能需要很长时间(12330 - 1800035条目)。有什么办法加快速度吗?我目前正在写一个记忆功能/表格,作为一种加快速度的方法,但任何人都可以想到一个更好的方式来提高这个速度吗?也许是一个不同的结构或其他我错过的东西。

When I want to do fuzzy string matching, I first check to see if the string is in the hashMap and then I iterate through all of the other potential keys, calculating the string similarity and storing the k,v pair/s with the highest similarity. Depending on which dictionary I am using this can take a long time ( 12330 - 1800035 entries ). Is there any way to speed this up or make it faster? I am currently writing a memoization function/table as a way of speeding this up, but can anyone else think of a better way to improve the speed of this? Maybe a different structure or something else I'm missing.

非常感谢提前,

Nathan

推荐答案

你想要的是一个BKTree(BK-Tree)结合Levenshtein Distance算法。 BKtree中的查找性能取决于您的搜索模糊。其中模糊被定义为搜索词和匹配之间的距离(编辑)的数量。

What your looking for is a BKTree (BK-Tree) combined with the Levenshtein Distance algorithm. The lookup performance in a BKtree depends on how "Fuzzy" your search is. Where fuzzy is defined as the number of distance (edits) between the search word and the matches.

这是一个关于这个主题的好博客:
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

Here is a good blog on the subject: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

有关性能的一些注释:
http://www.kafsemo.org/2010/08/03_bk-tree-performance-notes.html

Some notes on the performance: http://www.kafsemo.org/2010/08/03_bk-tree-performance-notes.html

有关 http://en.wikipedia.org/wiki/Levenshtein_distance 算法的注释。

此外,这里是一个用Java编写的BK树。应该提供以下界面的想法: http://code.google.com/p / java-bk-tree /

Also, here is a BK-Tree written in Java. Should give you an idea of the interface: http://code.google.com/p/java-bk-tree/

这篇关于提高模糊字符串匹配字典的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆