优化Levenshtein距离算法 [英] Optimizing Levenshtein distance algorithm

查看:345
本文介绍了优化Levenshtein距离算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储过程,该过程使用Levenshtein距离来确定最接近用户键入内容的结果.唯一真正影响速度的是在选择距离最小的记录之前计算所有记录的Levenshtein距离的函数(我已经通过将0代替对Levenshtein函数的调用来验证了这一点).该表有150万条记录,因此即使稍作调整也可以节省几秒钟的时间.现在,整个过程耗时超过10分钟.这是我使用的方法:

ALTER function dbo.Levenshtein
( 
    @Source nvarchar(200), 
    @Target nvarchar(200) 
) 
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)

SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0

WHILE @j <= @Target_len
BEGIN
    SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END

WHILE @i <= @Source_len
BEGIN
    SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1

WHILE @j <= @Target_len
BEGIN
    SET @Dist = @Dist + 1
    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
                  CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END

    IF @Dist > @Dist_temp
    BEGIN
        SET @Dist = @Dist_temp
    END

    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1

    IF @Dist > @Dist_temp SET @Dist = @Dist_temp
    BEGIN
        SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
    END
END

SELECT @Distv1 = @Distv0, @i = @i + 1
END

RETURN @Dist
END

我应该从这里去哪里?

解决方案

我过去这样做的方法是将数据库"(实际上是拼写校正工具的单词词典)存储为trie. /p>

然后,我使用分支定界例程查找最近的匹配条目.对于小距离,花费的时间与距离成指数关系.对于大距离来说,字典的大小是线性的,就像您现在所看到的一样.

分支绑定本质上是trie的深度优先树遍历,但是有错误预算.在每个节点上,您都会跟踪当前的levenshtein距离,如果超出了预算,则可以修剪该树的该分支.

首先,您的预算为零.那只会找到完全匹配.如果找不到匹配项,则预算为1.它将找到相距1的匹配项.如果找不到,则以2的预算进行匹配,依此类推.这听起来效率不高,但是由于每次步行都比前一次步行花费更多时间,因此时间取决于您进行的最后一次步行.

添加了:代码概要(请原谅我的C语言):

// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
  tnodeTag* p[128];
} tnode;

tnode* top; // the top of the trie

void walk(tnode* p, char* s, int budget){
  int i;
  if (*s == 0){
    if (p == NULL){
      // print the current trie path
    }
  }
  else if (budget >= 0){
    // try deleting this letter
    walk(p, s+1, budget-1);
    // try swapping two adjacent letters
    if (s[1]){
      swap(s[0], s[1]);
      walk(p, s, budget-1);
      swap(s[0], s[1]);
    }
    if (p){
      for (i = 0; i < 128; i++){
        // try exact match
        if (i == *s) walk(p->p[i], s+1, budget);
        // try replacing this character
        if (i != *s) walk(p->p[i], s+1, budget-1);
        // try inserting this letter
        walk(p->p[i], s, budget-1);
      }
    }
  }
}

基本上,您可以通过跳过字母并在同一节点上进行搜索来模拟删除字母.您可以模拟通过降低Trie而不插入s来插入字母.您可以模拟一个字母来替换一个字母,即使该字母不匹配也是如此.一经发现,您就可以添加其他可能的不匹配项,例如将0替换为O,将1替换为L或I-像这样的愚蠢的东西.

您可能想添加一个字符数组参数来表示您在trie中找到的当前单词.

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:

ALTER function dbo.Levenshtein
( 
    @Source nvarchar(200), 
    @Target nvarchar(200) 
) 
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)

SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0

WHILE @j <= @Target_len
BEGIN
    SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END

WHILE @i <= @Source_len
BEGIN
    SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1

WHILE @j <= @Target_len
BEGIN
    SET @Dist = @Dist + 1
    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
                  CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END

    IF @Dist > @Dist_temp
    BEGIN
        SET @Dist = @Dist_temp
    END

    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1

    IF @Dist > @Dist_temp SET @Dist = @Dist_temp
    BEGIN
        SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
    END
END

SELECT @Distv1 = @Distv0, @i = @i + 1
END

RETURN @Dist
END

Where should I go from here?

解决方案

The way I've done this in the past is to store the "database" (actually a dictionary of words for a spelling correcter) as a trie.

Then I used a branch-and-bound routine to look up nearest matching entries. For small distances, the time it takes is exponential in the distance. For large distances, it is linear in the size of the dictionary, just as you are seeing now.

Branch-and-bound is basically a depth-first tree walk of the trie, but with an error budget. At each node, you keep track of the current levenshtein distance, and if it exceeds the budget, you prune that branch of the tree.

First you do the walk with a budget of zero. That will only find exact matches. If you don't find a match, then you walk it with a budget of one. That will find matches at a distance of 1. If you don't find any, then you do it with a budget of 2, and so on. This sounds inefficient, but since each walk takes so much more time than the previous one, the time is dominated by the last walk that you make.

Added: outline of code (pardon my C):

// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
  tnodeTag* p[128];
} tnode;

tnode* top; // the top of the trie

void walk(tnode* p, char* s, int budget){
  int i;
  if (*s == 0){
    if (p == NULL){
      // print the current trie path
    }
  }
  else if (budget >= 0){
    // try deleting this letter
    walk(p, s+1, budget-1);
    // try swapping two adjacent letters
    if (s[1]){
      swap(s[0], s[1]);
      walk(p, s, budget-1);
      swap(s[0], s[1]);
    }
    if (p){
      for (i = 0; i < 128; i++){
        // try exact match
        if (i == *s) walk(p->p[i], s+1, budget);
        // try replacing this character
        if (i != *s) walk(p->p[i], s+1, budget-1);
        // try inserting this letter
        walk(p->p[i], s, budget-1);
      }
    }
  }
}

Basically, you simulate deleting a letter by skipping it and searching at the same node. You simulate inserting a letter by descending the trie without advancing s. You simulate replacing a letter by acting as if the letter matched, even though it doesn't. When you get the hang of it, you can add other possible mismatches, like replacing 0 with O and 1 with L or I - dumb stuff like that.

You probably want to add a character array argument to represent the current word you are finding in the trie.

这篇关于优化Levenshtein距离算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆