使用优化的 Levenshtein 算法寻找最近的邻居 [英] Finding closest neighbour using optimized Levenshtein Algorithm

查看:24
本文介绍了使用优化的 Levenshtein 算法寻找最近的邻居的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发布了一个问题关于优化算法以计算Levenshtein 距离,这些回复将我引向关于 Levenshtein Distance 的维基百科文章.

I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance, and the replies lead me to the Wikipedia article on Levenshtein Distance.

文章提到,如果在最大距离上有一个限制 k 一个可能的结果可以来自给定的查询,那么运行时间可以从 O(mn)O(kn)mn 是字符串的长度.我查了算法,但我真的不知道如何实现它.我希望在这里得到一些线索.

The article mentioned that if there is a bound k on the maximum distance a possible result can be from the given query, then the running time can be reduced from O(mn) to O(kn), m and n being the lengths of the strings. I looked up the algorithm, but I couldn't really figure out how to implement it. I was hoping to get some leads on that here.

优化在可能的改进"下排名第 4.

The optimization is #4 under "Possible Improvements".

让我困惑的部分是说我们只需要计算宽度 2k+1 的对角条纹,以主对角线为中心(主对角线定义为坐标(i,i)).

The part that confused me was the one that said that we only need to compute a diagonal stripe of width 2k+1, centered on the main diagonal (the main diagonal is defined as coordinates (i,i)).

如果有人可以提供一些帮助/见解,我将不胜感激.如果需要,我可以将书中算法的完整描述作为答案发布在此处.

If someone could offer some help/insight, I would really appreciate it. If needed, I can post the complete description of the algorithm in the book as an answer here.

推荐答案

我已经做过很多次了.我这样做的方法是对可能变化的游戏树进行递归深度优先树遍历.有一个预算 k 的变化,我用它来修剪树.有了这个程序,首先我用 k=0,然后 k=1,然后 k=2 运行它,直到我得到命中或者我不想再高了.

I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}

添加以解释特里搜索:

// definition of trie-node:
struct TNode {
  TNode* pa[128]; // for each possible character, pointer to subnode
};

// simple trie-walk of a node
// key is the input word, answer is the output word,
// i is the character position, and hdis is the hamming distance.
void walk(TNode* p, char key[], char answer[], int i, int hdis){
  // If this is the end of a word in the trie, it is marked as
  // having something non-null under the '' entry of the trie.
  if (p->pa[0] != null){
    if (key[i] == '') printf("answer = %s, hdis = %d
", answer, hdis);
  }
  // for every actual subnode of the trie
  for(char c = 1; c < 128; c++){
    // if it is a real subnode
    if (p->pa[c] != null){
      // keep track of the answer word represented by the trie
      answer[i] = c; answer[i+1] = '';
      // and walk that subnode
      // If the answer disagrees with the key, increment the hamming distance
      walk(p->pa[c], key, answer, i+1, (answer[i]==key[i] ? hdis : hdis+1));
    }
  }
}
// Note: you have to edit this to handle short keys.
// Simplest is to just append a lot of '' bytes to the key.

现在,为了限制预算,如果hdis太大就拒绝下降.

Now, to limit it to a budget, just refuse to descend if hdis is too large.

这篇关于使用优化的 Levenshtein 算法寻找最近的邻居的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆