查找使用优化Levenshtein算法最近的邻居 [英] Finding closest neighbour using optimized Levenshtein Algorithm
问题描述
我最近<一个href="http://stackoverflow.com/questions/3183149/most-efficient-way-to-calculate-levenshtein-distance">posted问题有关优化算法计算Levenshtein距离,并答复使我上的 Levenshtein距离。
I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance, and the replies lead me to the Wikipedia article on Levenshtein Distance.
文章提到,如果有一个约束的 K 的上的最大距离一个可能的结果,可以从给定的查询,则运行时间可从 O(MN)还原到 O(KN)的 M 和 N 的是字符串的长度。我抬头看了看算法,但我真的不能想出如何实现它。我希望能得到一些线索在这里。
The article mentioned that if there is a bound k on the maximum distance a possible result can be from the given query, then the running time can be reduced from O(mn) to O(kn), m and n being the lengths of the strings. I looked up the algorithm, but I couldn't really figure out how to implement it. I was hoping to get some leads on that here.
优化是#4在可能的改进。
The optimization is #4 under "Possible Improvements".
这困惑我是一个说,我们只需要计算宽度的角条形的部分的 2K + 1 的,集中于主对角线(主对角线定义为坐标(i ,I))。
The part that confused me was the one that said that we only need to compute a diagonal stripe of width 2k+1, centered on the main diagonal (the main diagonal is defined as coordinates (i,i)).
如果有人可以提供一些帮助/洞察力,我真的AP preciate它。如果需要,我可以张贴的算法在书中完整描述这里的答案。
If someone could offer some help/insight, I would really appreciate it. If needed, I can post the complete description of the algorithm in the book as an answer here.
推荐答案
我已经做了很多次。我做的方式是一个递归的深度优先树步行比赛可能发生的变化树。有变化,我用它来修剪树的预算的 K 的。有了这些程序中,首先我跟K = 0运行,则k = 1,则K = 2,直到我要么得到一击或者我不想去任何更高。
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
补充解释特里搜索:
Added to explain trie-search:
// definition of trie-node:
struct TNode {
TNode* pa[128]; // for each possible character, pointer to subnode
};
// simple trie-walk of a node
// key is the input word, answer is the output word,
// i is the character position, and hdis is the hamming distance.
void walk(TNode* p, char key[], char answer[], int i, int hdis){
// If this is the end of a word in the trie, it is marked as
// having something non-null under the '\0' entry of the trie.
if (p->pa[0] != null){
if (key[i] == '\0') printf("answer = %s, hdis = %d\n", answer, hdis);
}
// for every actual subnode of the trie
for(char c = 1; c < 128; c++){
// if it is a real subnode
if (p->pa[c] != null){
// keep track of the answer word represented by the trie
answer[i] = c; answer[i+1] = '\0';
// and walk that subnode
// If the answer disagrees with the key, increment the hamming distance
walk(p->pa[c], key, answer, i+1, (answer[i]==key[i] ? hdis : hdis+1));
}
}
}
// Note: you have to edit this to handle short keys.
// Simplest is to just append a lot of '\0' bytes to the key.
现在,把它限制在一个预算,只是不下降,如果HDIS太大。
Now, to limit it to a budget, just refuse to descend if hdis is too large.
这篇关于查找使用优化Levenshtein算法最近的邻居的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!