莱文斯坦距离:从矩阵推断编辑操作 [英] Levenshtein Distance: Inferring the edit operations from the matrix

查看:395
本文介绍了莱文斯坦距离:从矩阵推断编辑操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在C写Levenshtein算法++

I wrote Levenshtein algorithm in in C++

如果我输入:
字符串s:民主主义
串T:共和

If I input:
string s: democrat
string t: republican

我得到的矩阵D填充和操作(Levenshtein距离)的数量可以读D [10] [8] = 8
超出填充矩阵欲构造的最优解。如何要看这个解决方案?我没有想法。
请只写我如何一定要认准这个例子。

I get the matrix D filled-up and the number of operations (the Levenshtein distance) can be read in D[10][8] = 8
Beyond the filled matrix I want to construct the optimal solution. How must look this solution? I don't have an idea.
Please only write me HOW MUST LOOK for this example.

推荐答案

的问题是
鉴于Levenshtein算法产生的矩阵,人们怎样才能找到 最优解
也就是说,我们怎样才能找到字符串操作的precise顺序:插入,删除和替换[单号],就要到的字符串'转换为'T字符串

The question is
Given the matrix produced by the Levenshtein algorithm, how can one find "the optimal solution"?
i.e. how can we find the precise sequence of string operations: inserts, deletes and substitution [of a single letter], necessary to convert the 's string' into the 't string'?

首先,应该指出的是,在很多情况下,有几个最优解。照片虽然Levenshtein算法提供的民主党人的的最小数量的操作的(8 /共和例子)有许多序列(8操作),它可以产生这种转化。

First, it should be noted that in many cases there are SEVERAL optimal solutions.
While the Levenshtein algorithm supplies the minimum number of operations (8 in democrat/republican example) there are many sequences (of 8 operations) which can produce this conversion.

通过解码的莱文斯坦矩阵,可以枚举所有这样的最优序列。
总的想法是,最佳的解决方案都遵循的一个路径,从左上角向右下角(或在其他方向),由此,矩阵元值这条道路上或者保持相同或增加一个(或减少由一个在反方向),从0开始,并在操作中的问题(0至8民主主义/共和情况)串的最佳数目结束。的数目增加时的动作是必要的,它保持不变时在相应位置中的串的字母是相同的。

By "decoding" the Levenshtein matrix, one can enumerate ALL such optimal sequences.
The general idea is that the optimal solutions all follow a "path", from top left corner to bottom right corner (or in the other direction), whereby the matrix cell values on this path either remain the same or increase by one (or decrease by one in the reverse direction), starting at 0 and ending at the optimal number of operations for the strings in question (0 thru 8 democrat/republican case). The number increases when an operation is necessary, it stays the same when the letter at corresponding positions in the strings are the same.

有易产生其产生这样的路径的算法(稍微复杂,以产生所有可能的路径),并从这样的路径推断的操作顺序

It is easy to produce an algorithm which produces such a path (slightly more complicated to produce all possible paths), and from such path deduce the sequence of operations.

这寻路算法应该开始在右下角和后向工作的方式。这样做的原因的方法是,我们知道一个事实,即要它必须在此角结束的最优解,并且结束在这个角,它必须有来自3细胞或者立即到其中的一个左,上面刚刚这或立即对角线。通过选择这三个小区中的小区,其中一个满足我们的相同的值,或由一个降低的要求,我们有效地选择一个单元格的最佳途径之一。通过重复操作,直到我们在左上角获得(或实际上,直到我们达到了0值的单元格),我们有效地走回头路的最佳路径我们的方式。

This path finding algorithm should start at the lower right corner and work its way backward. The reason for this approach is that we know for a fact that to be an optimal solution it must end in this corner, and to end in this corner, it must have come from one of the 3 cells either immediately to its left, immediately above it or immediately diagonally. By selecting a cell among these three cells, one which satisfies our "same value or decreasing by one" requirement, we effectively pick a cell on one of the optimal paths. By repeating the operation till we get on upper left corner (or indeed until we reach a cell with a 0 value), we effectively backtrack our way on an optimal path.

还应当指出的是,可以建立所述基体中的两种方法之一:与'民主主义水平或垂直。这不改变Levenshtein距离的计算,也不需要改变操作的列表;它只是改变了方式,我们除preT矩阵,例如水平移动的路径要么是指插入字符[从T串]或删除字符[关闭弦乐]取决于是否字符串s是水平或垂直在基质

It should also be noted that one can build the matrix in one of two ways: with 'democrat' horizontally or vertically. This doesn't change the computation of the Levenshtein distance nor does it change the list of operations needed; it only changes the way we interpret the matrix, for example moving horizontally on the "path" either means inserting a character [from the t string] or deleting a character [off the s string] depending whether 'string s' is "horizontal" or "vertical" in the matrix.

我会使用以下矩阵。因此,惯例是(只打算在左到右和/或顶部至底部方向)

I'll use the following matrix. The conventions are therefore (only going in the left-to-right and/or top-to-bottom directions)

  • 在一个水平移动是 INSERTION 的一封信,从'T字符串
  • 在一个垂直的举动是一个缺失从的字符串'的信
  • 在对角线移动可以是:
    • 一个无操作(字母在各位置都是一样的);数不改变
    • 一个替换(在相应部位的字母是不同的);数增加一个。
    • an horizontal move is an INSERTION of a letter from the 't string'
    • an vertical move is a DELETION of a letter from the 's string'
    • a diagonal move is either:
      • a no-operation (both letters at respective positions are the same); the number doesn't change
      • a SUBSTITUTION (letters at respective positions are distinct); the number increase by one.

      莱文斯坦矩阵S =民主主义,T =共和

            r  e  p  u  b  l  i  c  a  n
         0  1  2  3  4  5  6  7  8  9  10
      d  1  1  2  3  4  5  6  7  8  9  10
      e  2  2  1  2  3  4  5  6  7  8  9
      m  3  3  2  2  3  4  5  6  7  8  9
      o  4  4  3  3  3  4  5  6  7  8  9
      c  5  5  4  4  4  4  5  6  6  7  8
      r  6  5  5  5  5  5  5  6  7  7  8
      a  7  6  6  6  6  6  6  6  7  7  8
      t  8  7  7  7  7  7  7  7  7  8  8
      

      任意的方法,我用它来选择在几个可能的最佳路径之一路径下面是松散的描述:

      The arbitrary approach I use to select one path among several possible optimal paths is loosely described below:

      Starting at the bottom-rightmost cell, and working our way backward toward 
      the top left.
      
      For each "backward" step, consider the 3 cells directly adjacent to the current
      cell   (in the left, top or left+top directions)
      
         if the value in the diagonal cell (going up+left) is smaller or equal to the
            values found in the other two cells
         AND 
            if this is same or 1 minus the value of the current cell 
         then  "take the diagonal cell"
               if the value of the diagonal cell is one less than the current cell:
                  Add a SUBSTITUTION operation (from the letters corresponding to
                  the _current_ cell)
               otherwise: do not add an operation this was a no-operation.
      
         elseif the value in the cell to the right is smaller of equal to the value of
             the of the cell above current cell
         AND 
             if this value is same or 1 minus the value of the current cell 
         then "take the cell to right", and
              add an INSERTION of the letter corresponding to the cell
         else
             take the cell above, add
             Add a DELETION operation of the letter in 's string'
      

      根据这一非正式的伪code,我们得到如下:

      Following this informal pseudo-code, we get the following:

      开始在N,T单元格右下角。
      挑[对角线]一,一个小区作为下一个目的地,因为它是小于其它两个(并满足相同或-1的条件)。
      请注意,新的小区比当前小区
      少一个 因此,在第8步是替代T和N:    democra的 N

      Start on the "n", "t" cell at bottom right.
      Pick the [diagonal] "a", "a" cell as next destination since it is less than the other two (and satisfies the same or -1 condition).
      Note that the new cell is one less than current cell
      therefore the step 8 is substitute "t" with "n":     democra N

      继续一,一个细胞,
          挑[对角线]C,R单元格作为下一个目的地...
          请注意,新的小区是相同的值作为当前单元格==>的无操作需要的。

      Continue with "a", "a" cell,
      Pick the [diagonal] "c", "r" cell as next destination...
      Note that the new cell is same value as current cell ==> no operation needed.

      继续以C,R单元,
         挑[对角线]我,C型的下一个目的地...
         请注意,新的小区比当前小区
      少一个      因此,第7步是代替R和C:    democ的 C

      Continue with "c", "r" cell,
      Pick the [diagonal] "i", "c" cell as next destination...
      Note that the new cell is one less than current cell
      therefore the step 7 is substitute "r" with "c":     democ C an

      继续用我,C型,
         挑[对角线]L,O小区作为下一个目的地...
         请注意,新的小区比当前小区
      少一个      因此,第6步代替C与我:   演示的

      Continue with "i", "c" cell,
      Pick the [diagonal] "l", "o" cell as next destination...
      Note that the new cell is one less than current cell
      therefore the step 6 is substitute "c" with "i":     demo I can

      继续以L,O细胞,
         挑[对角线]B,M单元格作为下一个目的地...
         请注意,新的小区比当前小区
      少一个      因此,第5步替换O与L:   马克的 ICAN

      Continue with "l", "o" cell,
      Pick the [diagonal] "b", "m" cell as next destination...
      Note that the new cell is one less than current cell
      therefore the step 5 is substitute "o" with "l":     dem L ican

      继续B,M单元,
         挑[对角线]U,E小区作为下一个目的地...
         请注意,新的小区比当前小区
      少一个      因此,第4步是替代品M和B:   德 B lican

      Continue with "b", "m" cell,
      Pick the [diagonal]"u", "e" cell as next destination...
      Note that the new cell is one less than current cell
      therefore the step 4 is substitute "m" with "b":     de B lican

      继续U,E单元,
         注意对角线小区不符合,因为左的细胞是比它少。    挑[左]P,E小区作为下一个目的地...
              因此,在第3步是instertU,E之后:   德 U blican

      Continue with "u", "e" cell,
      Note the "diagonal" cell doesn't qualify, because the "left" cell is less than it. Pick the [left] "p", "e" cell as next destination...
      therefore the step 3 is instert "u" after "e":     de U blican

      继续以P,E单元,
         再次对角线小区不符合    挑[左]E,E小区作为下一个目的地...
              因此,第2步是instertp的后的e:   德 P ublican

      Continue with "p", "e" cell,
      again the "diagonal" cell doesn't qualify Pick the [left] "e", "e" cell as next destination...
      therefore the step 2 is instert "p" after "e":     de P ublican

      继续以E,E单元,
         挑[对角线]R,D型的下一个目的地...
         请注意,新的小区是相同的值作为当前单元格==>的无操作需要的。

      Continue with "e", "e" cell,
      Pick the [diagonal] "r", "d" cell as next destination...
      Note that the new cell is same value as current cell ==> no operation needed.

      继续以R,D型,
         挑[对角线]开始小区作为下一个目的地...
         请注意,新的小区比当前小区
      少一个      因此,第1步是替代D和R:    研究 epublican

      Continue with "r", "d" cell,
      Pick the [diagonal] "start" cell as next destination...
      Note that the new cell is one less than current cell
      therefore the step 1 is substitute "d" with "r":     R epublican

      您已经到达细胞,其值为0:你的工作就完成了

      You've arrived at a cell which value is 0 : your work is done!

      这篇关于莱文斯坦距离:从矩阵推断编辑操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆