优化Levenshtein距离的速度 [英] Optimize speed of Levenshtein distance of many words
问题描述
我有一个单元格字典,其中包含很多单词(大约15000个).
I have a cell array dictionary which contains a lot of words (ca. 15000).
我想为所有单词对计算函数strdist
(以计算Levenshtein距离).我尝试了两种方法,但是它们都非常慢.什么是更有效的解决方案?
I want to compute the function strdist
(to calculate the Levenshtein distance) for all the couples of words. I tried in two ways, but they are both really slow. What can be a more efficient solution?
这是我的代码(dict_keys是我的长度为m的单元格数组):
Here is my code (dict_keys is my cell array of length m):
1)
matrix = sparse(m,m);
for i = 1:m-1;
matrix(i,:) = cellfun( @(u) strdist(dict_keys{i},u), dict_keys );
end
2)
matrix = sparse(m,m);
for i = 1:m-1;
for j = i+1:m
matrix(i,j) = strdist(dict_keys{i},dict_keys{j});
end
end
推荐答案
函数'strdist'不是内置的matlab函数,所以我想如果您从
Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun
internally just expands into a loop.
如果strdist
是对称的,即strdist(a,b)== strdist(b,a),则可以节省一半的计算量.似乎是这种情况,因此仅在第二个循环(您正在执行的操作)中计算j<i
的所有情况.
If strdist
is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This seems to be the case, so only calculate all cases of j<i
in the second loop (which you are doing).
否则,您可以在C中将strdist作为mex函数实现,并且可能会看到一些显着的速度改进.可以在 rosettacode.org 中找到Levenshtein距离的C实现.
Otherwise you could implement strdist in C as a mex function and probably see some significant speed improvements. A C implementation of the Levenshtein distance can be found for example at rosettacode.org.
或深入研究该算法如何计算两个字符串的距离的详细信息,看看是否可以对其向量化并减少二次方的运行时间.但是,这可能不是很容易.
Or dig into the details of how the algorithm computes the distance of two strings and see if you can vectorize it and reduce the runtime from quadratic so less. This however is probably not very easy.
最后,如果您拥有并行计算工具箱的许可和多核CPU,则由于strdist
调用是完全相互独立的,因此可以轻松地并行化代码.
Finally if you have the Parallel Computing Toolbox licensed and a multicore CPU you can easily parallelize your code since the strdist
calls are completely independent of each other.
这篇关于优化Levenshtein距离的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!