高度可并行化的Levenstein距离算法 [英] Highly parallelizable Levenstein Distance Algorithm

查看:44
本文介绍了高度可并行化的Levenstein距离算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实际上必须实现一个字符串比较,在结尾处我得到一个匹配的百分比(不仅仅是布尔结果匹配/不匹配).因此,为此,我找到了Levenstein Distance算法.但是,现在的问题是性能.例如,我有1k个字符串可以相互比较,现在大约需要10分钟.对于每个我已经并行调用的算法,并在每个并行中再次调用.所以我得到了伪语言:

I actually have to implement a string comparaison where I get a matching percentage at the end (not just boolean result match/unmatch). So, to do that I've found the Levenstein Distance algorithm. But the issue is now on performance. For instance, I have 1k strings to compare to each other, it takes about 10 minutes now. For each I already call the algo in parallel and again within each it's done in parallel. So i got in pseudo language :

Foreach strings
    Call in parallel the comparaison method.

比较方法之内

Foreach stringsToCompare
    Call in parallel the Levenstein Distance algo.

在i5 @ 2.6Ghz上100%CPU使用率下仍然需要10分钟...

And still 10minutes at 100% CPU usage on a i5 @ 2.6Ghz...

这是我的实现方式

public static double GetSimilarity(string firstString, string secondString)
{
    if (ReferenceEquals(firstString, null)) throw new ArgumentNullException("firstString");
    if (ReferenceEquals(secondString, null)) throw new ArgumentNullException("secondString");
    if (firstString == secondString) return 100;

    return (1 - GetLevensteinDistance(firstString, secondString) / (double)Math.Max(firstString.Length, secondString.Length)) * 100;
}

private static int GetLevensteinDistance(string firstString, string secondString)
{
    if (ReferenceEquals(firstString, null)) throw new ArgumentNullException("firstString");
    if (ReferenceEquals(secondString, null)) throw new ArgumentNullException("secondString");
    if (firstString == secondString) return 1;

    int[,] matrix = new int[firstString.Length + 1, secondString.Length + 1];

    for (int i = 0; i <= firstString.Length; i++)
        matrix[i, 0] = i; // deletion
    for (int j = 0; j <= secondString.Length; j++)
        matrix[0, j] = j; // insertion

    for (int i = 0; i < firstString.Length; i++)
        for (int j = 0; j < secondString.Length; j++)
            if (firstString[i] == secondString[j])
                matrix[i + 1, j + 1] = matrix[i, j];
            else
            {
                matrix[i + 1, j + 1] = Math.Min(matrix[i, j + 1] + 1, matrix[i + 1, j] + 1); //deletion or insertion
                matrix[i + 1, j + 1] = Math.Min(matrix[i + 1, j + 1], matrix[i, j] + 1); //substitution
            }
    return matrix[firstString.Length, secondString.Length];
}

那么您知道类似的算法也许更适合于长文本比较或高度可并行化吗?

So do you know a similar algo which is perhaps more appropriate for long text comparison or highly parallelizable ?

推荐答案

我发现了一个名为SimMetrics的supa dupa库,其中包含许多相似算法的实现,当我对其进行基准测试时,在我的案例中有一些很棒的方法快得多.

I've found a supa dupa library called SimMetrics which contains a lot of implementation of similarity algorithm and when I benchmark them there are some great and usefull in my case much more faster.

如果您也有兴趣: http://sourceforge.net/projects/simmetrics/

这篇关于高度可并行化的Levenstein距离算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆