Damerau-Levenshtein距离实现 [英] Damerau-Levenshtein distance Implementation
问题描述
我正在尝试在JS中创建damerau-levenshtein距离函数.
I'm trying to create a damerau-levenshtein distance function in JS.
我在WIkipedia上找到了关于该算法的描述,但没有实现.它说:
I've found a description off the algorithm on WIkipedia, but they is no implementation off it. It says:
设计适当的算法来计算无限制 Damerau–Levenshtein距离请注意,始终存在最优 编辑操作的顺序,从不一次转换的字母 之后修改.因此,我们只需要考虑两种对称方式 多次修改子字符串的方法:(1)转置字母和 在它们之间插入任意数量的字符,或者(2)删除一个 相邻字符和转置字母的序列 删除后.这个想法的简单实现 三次复杂度的算法:O \ left(M \ cdot N \ cdot \ max(M,N) \ right),其中M和N是字符串长度.使用的想法 Lowrance和Wagner,[7]这种朴素的算法可以改进为 在最坏的情况下为O \ left(M \ cdot N \ right).有趣的是 可以修改bitap算法以处理转置.见 [1]的信息检索部分提供了这样的示例 适应.
To devise a proper algorithm to calculate unrestricted Damerau–Levenshtein distance note that there always exists an optimal sequence of edit operations, where once-transposed letters are never modified afterwards. Thus, we need to consider only two symmetric ways of modifying a substring more than once: (1) transpose letters and insert an arbitrary number of characters between them, or (2) delete a sequence of characters and transpose letters that become adjacent after deletion. The straightforward implementation of this idea gives an algorithm of cubic complexity: O\left (M \cdot N \cdot \max(M, N) \right ), where M and N are string lengths. Using the ideas of Lowrance and Wagner,[7] this naive algorithm can be improved to be O\left (M \cdot N \right) in the worst case. It is interesting that the bitap algorithm can be modified to process transposition. See the information retrieval section of[1] for an example of such an adaptation.
https://en.wikipedia.org/wiki/Damerau%E2% 80%93Levenshtein_distance
第[1]节指向 http://acl.ldc .upenn.edu/P/P00/P00-1037.pdf 这对我来说更加复杂.
The section [1] points to http://acl.ldc.upenn.edu/P/P00/P00-1037.pdf which is even more complicated to me.
如果我正确理解了这一点,那么创建一个实现就不那么容易了.
If I understood this correctly, it's not that easy to create an implementation off it.
这是我当前使用的levenshtein实现:
Here's the levenshtein implementation I currently use :
levenshtein=function (s1, s2) {
// discuss at: http://phpjs.org/functions/levenshtein/
// original by: Carlos R. L. Rodrigues (http://www.jsfromhell.com)
// bugfixed by: Onno Marsman
// revised by: Andrea Giammarchi (http://webreflection.blogspot.com)
// reimplemented by: Brett Zamir (http://brett-zamir.me)
// reimplemented by: Alexander M Beedie
// example 1: levenshtein('Kevin van Zonneveld', 'Kevin van Sommeveld');
// returns 1: 3
if (s1 == s2) {
return 0;
}
var s1_len = s1.length;
var s2_len = s2.length;
if (s1_len === 0) {
return s2_len;
}
if (s2_len === 0) {
return s1_len;
}
// BEGIN STATIC
var split = false;
try {
split = !('0')[0];
} catch (e) {
// Earlier IE may not support access by string index
split = true;
}
// END STATIC
if (split) {
s1 = s1.split('');
s2 = s2.split('');
}
var v0 = new Array(s1_len + 1);
var v1 = new Array(s1_len + 1);
var s1_idx = 0,
s2_idx = 0,
cost = 0;
for (s1_idx = 0; s1_idx < s1_len + 1; s1_idx++) {
v0[s1_idx] = s1_idx;
}
var char_s1 = '',
char_s2 = '';
for (s2_idx = 1; s2_idx <= s2_len; s2_idx++) {
v1[0] = s2_idx;
char_s2 = s2[s2_idx - 1];
for (s1_idx = 0; s1_idx < s1_len; s1_idx++) {
char_s1 = s1[s1_idx];
cost = (char_s1 == char_s2) ? 0 : 1;
var m_min = v0[s1_idx + 1] + 1;
var b = v1[s1_idx] + 1;
var c = v0[s1_idx] + cost;
if (b < m_min) {
m_min = b;
}
if (c < m_min) {
m_min = c;
}
v1[s1_idx + 1] = m_min;
}
var v_tmp = v0;
v0 = v1;
v1 = v_tmp;
}
return v0[s1_len];
}
构建这样一种算法的想法是什么,如果您认为它太复杂了,那么我该怎么做才能使"l"(L小写)和"I"(i大写)之间没有区别.
What are your ideas for building such an algorithm and, if you think it would be too complicated, what could I do to make no difference between 'l' (L lowercase) and 'I' (i uppercase) for example.
推荐答案
要点@doukremt给出了: https: //gist.github.com/doukremt/9473228
The gist @doukremt gave: https://gist.github.com/doukremt/9473228
在Javascript中提供了以下内容.
gives the following in Javascript.
您可以在加权对象中更改操作的权重.
You can change the weights of operations in the weighter object.
var levenshteinWeighted= function(seq1,seq2)
{
var len1=seq1.length;
var len2=seq2.length;
var i, j;
var dist;
var ic, dc, rc;
var last, old, column;
var weighter={
insert:function(c) { return 1.; },
delete:function(c) { return 0.5; },
replace:function(c, d) { return 0.3; }
};
/* don't swap the sequences, or this is gonna be painful */
if (len1 == 0 || len2 == 0) {
dist = 0;
while (len1)
dist += weighter.delete(seq1[--len1]);
while (len2)
dist += weighter.insert(seq2[--len2]);
return dist;
}
column = []; // malloc((len2 + 1) * sizeof(double));
//if (!column) return -1;
column[0] = 0;
for (j = 1; j <= len2; ++j)
column[j] = column[j - 1] + weighter.insert(seq2[j - 1]);
for (i = 1; i <= len1; ++i) {
last = column[0]; /* m[i-1][0] */
column[0] += weighter.delete(seq1[i - 1]); /* m[i][0] */
for (j = 1; j <= len2; ++j) {
old = column[j];
if (seq1[i - 1] == seq2[j - 1]) {
column[j] = last; /* m[i-1][j-1] */
} else {
ic = column[j - 1] + weighter.insert(seq2[j - 1]); /* m[i][j-1] */
dc = column[j] + weighter.delete(seq1[i - 1]); /* m[i-1][j] */
rc = last + weighter.replace(seq1[i - 1], seq2[j - 1]); /* m[i-1][j-1] */
column[j] = ic < dc ? ic : (dc < rc ? dc : rc);
}
last = old;
}
}
dist = column[len2];
return dist;
}
这篇关于Damerau-Levenshtein距离实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!