优化Levenshtein距离的速度 [英] Optimize speed of Levenshtein distance of many words

查看:135
本文介绍了优化Levenshtein距离的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个单元格字典,其中包含很多单词(大约15000个).

I have a cell array dictionary which contains a lot of words (ca. 15000).

我想为所有单词对计算函数strdist(以计算Levenshtein距离).我尝试了两种方法,但是它们都非常慢.什么是更有效的解决方案?

I want to compute the function strdist (to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?

这是我的代码(dict_keys是我的长度为m的单元格数组):

Here is my code (dict_keys is my cell array of length m):

1)

matrix = sparse(m,m);
for i = 1:m-1;
    matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys );
end

2)

matrix = sparse(m,m);
for i = 1:m-1;
  for j = i+1:m
     matrix(i,j) = strdist(dict_keys{i},dict_keys{j});
  end   
end

推荐答案

函数'strdist'不是内置的matlab函数,所以我想如果您从

Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop.

如果strdist是对称的,即strdist(a,b)== strdist(b,a),则可以节省一半的计算量.似乎是这种情况,因此仅在第二个循环(您正在执行的操作)中计算j<i的所有情况.

If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i in the second loop (which you are doing).

否则,您可以在C中将strdist作为mex函数实现,并且可能会看到一些显着的速度改进.可以在 rosettacode.org 中找到Levenshtein距离的C实现.

Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.org.

或深入研究该算法如何计算两个字符串的距离的详细信息,看看是否可以对其向量化并减少二次方的运行时间.但是,这可能不是很容易.

Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.

最后,如果您拥有并行计算工具箱的许可和多核CPU,则由于strdist调用是完全相互独立的,因此可以轻松地并行化代码.

Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist calls are completely independent of each other.

这篇关于优化Levenshtein距离的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆