更好的模糊匹配性能? [英] Better fuzzy matching performance?

查看:133
本文介绍了更好的模糊匹配性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用 difflib 中的方法get_close_matches方法进行迭代15,000个字符串的列表,以与大约15,000个字符串的另一个列表最接近:

I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings:

a=['blah','pie','apple'...]
b=['jimbo','zomg','pie'...]

for value in a:
    difflib.get_close_matches(value,b,n=1,cutoff=.85)

每个值花费.58秒,这意味着完成循环将花费8,714秒或145分钟.是否有另一种库/方法可能更快或更有效地提高了此方法的速度?我已经尝试过将两个数组都转换为小写,但这只会导致速度略有提高.

It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for this method? I've already tried converting both arrays to lower case, but it only resulted in a slight speed increase.

推荐答案

fuzzyset 按字符串的二元组和三字母组索引,以便在difflib O(log(N)) O(N)中找到近似的匹配项.对于我的1M +个单词和单词对的模糊集,它可以在大约20秒内计算索引,并在不到100毫秒的时间内找到最接近的匹配项.

fuzzyset indexes strings by their bigrams and trigrams so it finds approximate matches in O(log(N)) vs O(N) for difflib. For my fuzzyset of 1M+ words and word-pairs it can compute the index in about 20 seconds and find the closest match in less than a 100 ms.

这篇关于更好的模糊匹配性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆