计算Levenshtein距离的最有效方法 [英] Most efficient way to calculate Levenshtein distance

查看:137
本文介绍了计算Levenshtein距离的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是实现了一个最佳匹配的文件搜索算法找出最匹配的字符串的字典中。分析我的code后,我发现,绝大多数的时间都花在计算查询和可能的结果之间的距离。我目前正在实施的算法使用2-D阵列,这使得实施为O(n ^ 2)操作来计算Levenshtein距离。我希望有人可以建议做同样的一个更快的方法。

下面是我实现的:

 公众诠释计算(串根,查询字符串)
 {
  INT改编[] [] =新INT [root.length()+ 2] [query.length()+ 2];

  的for(int i = 2; I< root.length()+ 2;我++)
  {
   改编[I] [0] =(int)的root.charAt第(i-2);
   改编[I] [1] =(I-1);
  }

  的for(int i = 2; I< query.length()+ 2;我++)
  {
   改编[0] [I] =(int)的query.charAt第(i-2);
   改编[1] [I] =第(i-1);
  }

  的for(int i = 2; I< root.length()+ 2;我++)
   对于(INT J = 2; J< query.length()+ 2; J ++)
   {
    INT的diff = 0;
    如果(ARR [0] [J]!=改编[I] [0])
     的diff = 1;
    改编[I] [j]的=分钟((改编[I-1] [j]的1),(改编[I] [J-1] 1),(改编[I-1] [J-1] + DIFF));
   }

  返回ARR [root.length()+ 1] [query.length()+ 1];
 }

 公众诠释分钟(INT N1,N2 INT,INT N3)
 {
  返回(int)的Math.min(N1,Math.min(N2,N3));
 }
 

解决方案

在Levenshtein距离的维基百科条目有一些有用的建议为优化计算 - 最适用之一,你的情况是,如果你可以把绑定 K 感兴趣的最大距离(超出任何可能会成为无限!),你可以通过减少计算为 O(n次K)而不是为O(n的平方)(基本放弃尽快最短距离变>:K

由于您正在寻找最接近的匹配,你可以逐渐减小 K 来的最佳匹配的距离迄今发现 - 这不会影响最坏例行为(作为比赛的也许的是递减顺序的距离,这意味着你将永远不会救助任何越快),但一般情况下,应该改善。

我认为,如果你需要得到的基本的更好的性能,您可能必须接受一些有实力的妥协,计算更加大致距离(因此得到非常好的颜色匹配,而不是一定最优的)。

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root,String query)
 {
  int arr[][] = new int[root.length()+2][query.length()+2];

  for(int i=2;i<root.length()+2;i++)
  {
   arr[i][0] = (int)root.charAt(i-2);
   arr[i][1] = (i-1);
  }

  for(int i=2;i<query.length()+2;i++)
  {
   arr[0][i] = (int)query.charAt(i-2);
   arr[1][i] = (i-1);
  }

  for(int i=2;i<root.length()+2;i++)
   for(int j=2;j<query.length()+2;j++)
   {
    int diff=0;
    if(arr[0][j]!=arr[i][0])
     diff = 1;
    arr[i][j]=min((arr[i-1][j]+1),(arr[i][j-1]+1),(arr[i-1][j-1]+diff));
   }

  return arr[root.length()+1][query.length()+1];
 }

 public int min(int n1, int n2, int n3)
 {
  return (int)Math.min(n1,Math.min(n2,n3));
 }

解决方案

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

这篇关于计算Levenshtein距离的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆