最小编辑距离重建 [英] Minimum Edit Distance Reconstruction

查看:73
本文介绍了最小编辑距离重建的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在堆栈上和在线上都可以找到类似的答案,但是我觉得我缺少一些东西.给定以下代码,我们需要重建导致最小编辑距离的事件序列.对于下面的代码,我们需要编写一个输出以下函数:

I know there are similar answer to this on stack, as well as online, but I feel I'm missing something. Given the code below, we need to reconstruct the sequence of events that led to the resulting minimum edit distance. For the code below, we need to write a function that outputs:

Equal, L, L
Delete, E
Equal, A, A
Substitute, D, S
Insert, T

代码已通过我的(部分正确的)解决方案更新

这是代码,还有我的部分解决方案.例如,它的工作方式是给我("lead"->"last"),但不适用于下面的示例("hint"->"isnt").我怀疑这是因为第一个字符相等,这使我的代码无效.正确方向上的任何提示或指示都将很棒!

Here is the code, with my partial solution. It works for example I was given ("lead" -> "last"), but doesn't work for the example below ("hint" -> "isnt"). I suspect this is because the first character is equal, which is throwing off my code. Any tips or pointers in the right direction would be great!

def printMatrix(M):
        for row in M:
                print row
        print

def med(s, t):  
        k = len(s) + 1
        l = len(t) + 1

        M = [[0 for i in range(k)] for j in range(l)]
        MTrace = [["" for i in range(k)] for j in range(l)]

        M[0][0] = 0


        for i in xrange(0, k):
                M[i][0] = i
                MTrace[i][0] = s[i-1]

        for j in xrange(0, l):
                M[0][j] = j
                MTrace[0][j] = t[j-1]

        MTrace[0][0] = "DONE"

        for i in xrange(1, k):
                for j in xrange(1, l):

                        sub = 1
                        sub_op = "sub"
                        if s[i-1] == t[j-1]:
                                # equality
                                sub = 0
                                sub_op = "eq"


                        # deletion
                        min_value = M[i-1][j] + 1
                        op = "del"
                        if min_value > M[i][j-1] + 1:
                                # insertion
                                min_value = M[i][j-1] + 1
                                op = "ins"
                        if min_value > M[i-1][j-1] + sub:
                                # substitution
                                min_value = M[i-1][j-1] + sub
                                op = sub_op


                        M[i][j] = min_value
                        MTrace[i][j] = op                        

        print "final Matrix"
        printMatrix(M)
        printMatrix(MTrace)

############ MY PARTIAL SOLUTION

        def array_append(array,x,y):
            ops_string = MTrace[x][y]
            if ops_string == 'ins':
                array.append(("Insert",MTrace[0][y]))
            elif ops_string == 'sub':
                array.append(("Substitute",MTrace[x][0],MTrace[0][y]))
            elif ops_string == 'eq':
                array.append(("Equal",MTrace[x][0],MTrace[0][y]))
            elif ops_string == 'del':
                array.append(("Delete",MTrace[x][0]))


        i = len(s)
        j = len(t)

        ops_array = []
        base = M[i][j]
        array_append(ops_array,i,j)


        while MTrace[i][j] != "DONE":
            base = M[i][j]
            local_min = min(M[i][j-1],M[i-1][j],M[i-1][j-1])
            if base == local_min:
                i = i - 1
                j = j - 1
                array_append(ops_array,i,j)
            elif M[i][j-1] < M[i-1][j]:
                j = j -1
                array_append(ops_array,i,j)
            elif M[i-1][j] < M[i][j-1]:
                i = i - 1
                array_append(ops_array,i,j)
            else:
                i = i - 1
                j = j - 1
                array_append(ops_array,i,j)

        print ops_array
#########

        return M[k-1][l-1]      

print med('lead', 'last')

推荐答案

在这种情况下,我认为更深入地了解算法很重要.除了向您提供一些伪代码外,我还将向您介绍该算法的基本步骤,并向您展示如何在所需的最终矩阵中编码"所需的数据.当然,如果您不需要使用自己的算法,那么显然您应该使用其他人的算法,如 MattH建议

It's my opinion that understanding the algorithm more deeply is important in this case. Rather than giving you some pseudocode, I'll walk you through the essential steps of the algorithm, and show you how the data you want is "encoded" in the final matrix that results. Of course, if you don't need to roll your own algorithm, then you should obviously just use someone else's, as MattH suggests!

在我看来,这就像 Wagner-Fischer算法.基本思想是计算附近"前缀之间的距离,取最小值,然后从中计算出当前字符串对的距离.例如,假设您有两个字符串'i''h'.让我们沿着矩阵的垂直轴和水平轴进行布局,如下所示:

This looks to me like an implementation of the Wagner-Fischer algorithm. The basic idea is to calculate the distances between "nearby" prefixes, take the minimum, and then calculate the distance for the current pair of strings from that. So for example, say you have two strings 'i' and 'h'. Let's lay them out along the vertical and horizontal axes of a matrix, like so:

  _ h
_ 0 1
i 1 1

在这里,'_'表示一个空字符串,矩阵中的每个单元格对应于一个将输入('''i')输入到输出('''h')的编辑序列.

Here, '_' denotes an empty string, and each cell in the matrix corresponds to an edit sequence that takes an input ('' or 'i') to an output ('' or 'h').

从空字符串到长度为L的任何字符串的距离为L(需要插入L个字符).从任何长度为L的字符串到空字符串的距离也为L(需要删除L个字符).这覆盖了第一行和第一列中的值,这些值只是递增.

The distance from the empty string to any string of length L is L, (requiring L insertions). The distance from any string of length L to the empty string is also L (requiring L deletions). That covers the values in the first row and column, which simply increment.

从那里,您可以通过从上,左和左上的值中取最小值,然后加上一个,或者如果字符串中的那个点的字母相同,来计算任何位置的值.保持左上角的值不变.对于上表中(1, 1)的值,最小值是(0, 0)0,因此(1, 1)的值是1,这是从'i''h'的最小编辑距离(一种替代).因此,通常,最小编辑距离始终位于矩阵的右下角.

From there, you can calculate the value of any location by taking the minimum from among the upper, left, and upper-left values, and adding one, or, if the letter is the same at that point in the string, taking the upper-left value unchanged. For the value at (1, 1) in the table above, the minimum is 0 at (0, 0), so the value at (1, 1) is 1, and that's the minimum edit distance from 'i' to 'h' (one substitution). So in general, the minimum edit distance is always in the lower right corner of the matrix.

现在让我们再做一次,将ishi进行比较.同样,矩阵中的每个单元格都对应一个编辑序列,该编辑序列将输入('''i''is')输入到输出('''h''hi').

Now let's do another, comparing is to hi. Here again, each cell in the matrix corresponds to an edit sequence that takes an input ('', 'i', or 'is') to an output ('', 'h', or 'hi').

  _ h i
_ 0 1 2
i 1 1 #
s 2 # #

我们首先扩大矩阵,将#用作尚不知道的值的占位符,然后通过递增来扩展第一行和第一列.这样,我们就可以开始计算上面标记为#的位置的结果.让我们从(2, 1)开始(以(行,列),即行为主的表示法).在上,左上和左值中,最小值为1.表中对应的字母不同-sh-因此我们在该最小值上加一个以获得2,然后继续.

We begin by enlarging the matrix, using # as a placeholder for values we don't know yet, and extending the first row and column by incrementing. Having done so, we can begin calculating results for positions marked # above. Let's start at (2, 1) (in (row, column), i.e. row-major notation). Among the upper, upper-left, and left values, the minimum is 1. The corresponding letters in the table are different -- s and h -- so we add one to that minimum value to get 2, and carry on.

  _ h i
_ 0 1 2
i 1 1 #
s 2 2 #

让我们继续在(1, 2)处的值.现在情况有所不同,因为表中的相应字母相同-它们都是i.这意味着我们可以选择在左上角单元格中添加值而无需添加一个.这里的指导直觉是我们不必增加计数,因为在此位置将相同的字母添加到两个字符串中.而且由于两根弦的长度都增加了一个,所以我们沿对角线移动.

Let's move on to the value at (1, 2). Now things go a little differently because the corresponding letters in the table are the same -- they're both i. This means we have the option of taking the value in the upper-left cell without adding one. The guiding intuition here is that we don't have to increase the count because the same letter is being added to both strings at this position. And since the lengths of both strings have increased by one, we move diagonally.

  _ h i
_ 0 1 2
i 1 1 1
s 2 2 #

最后一个空单元格使情况恢复正常.对应的字母是si,因此我们再次取最小值并加一个,得到2:

With the last empty cell, things go back to normal. The corresponding letters are s and i, and so we again take the minimum value and add one, to get 2:

  _ h i
_ 0 1 2
i 1 1 1
s 2 2 2

如果继续对两个较长的单词(以ishi开头的单词-isnt(忽略标点符号)和hint:

Here's the table we get if we continue this process for two longer words that start with is and hi -- isnt (ignoring punctuation) and hint:

  _ h i n t
_ 0 1 2 3 4
i 1 1 1 2 3
s 2 2 2 2 3
n 3 3 3 2 3
t 4 4 4 3 2

此矩阵稍微复杂一点,但是这里的最终最小编辑距离仍然仅为2,因为这两个字符串的最后两个字母相同.方便的!

This matrix is slightly more complex, but the final minimum edit distance here is still just 2, because the last two letters of these two strings are the same. Convenient!

那么我们如何从该表中提取编辑类型?关键是要意识到桌子上的移动对应于特定的编辑类型.因此,例如,从(0, 0)(0, 1)的向右移动将我们从不需要编辑的_ -> _带到需要编辑的插入的_ -> h.同样,从(0, 0)(1, 0)的向下移动将我们从不需要编辑的_ -> _带到需要编辑的一个i -> _,即删除.最后,从(0, 0)(1, 1)的对角线运动将我们从不需要编辑的_ -> _带到了需要进行一次编辑和替换的i -> h.

So how can we extract the types of edits from this table? The key is to realize that movement on the table corresponds to particular types of edits. So for example, a rightward movement from (0, 0) to (0, 1) takes us from _ -> _, requiring no edits, to _ -> h, requiring one edit, an insertion. Likewise, a downward movement from (0, 0) to (1, 0) takes us from _ -> _, requiring no edits, to i -> _, requiring one edit, a deletion. And finally, a diagonal movement from (0, 0) to (1, 1) takes us from _ -> _, requiring no edits, to i -> h, requiring one edit, a substitution.

因此,现在我们要做的就是反转步骤,从上,左和左上单元格中追踪局部最小值回到原点(0, 0),请记住,如果当前值相同作为最低要求,那么我们必须转到左上角的单元格,因为这是唯一一种不会增加编辑距离的移动.

So now all we have to do is reverse our steps, tracing local minima from among the upper, left, and upper-left cells back to the origin, (0, 0), keeping in mind that if the current value is the same as the minimum, then we must go to the upper-left cell, since that's the only kind of movement that doesn't increment the edit distance.

这里是您可以采取的步骤的详细说明.从完成的矩阵的右下角开始,重复以下操作,直到到达左上角:

Here is a detailed description of the steps you could take to do so. Starting from the lower-right corner of the completed matrix, repeat the following until you reach the upper-left corner:

  1. 查看左上方的相邻单元格.如果不存在,请转到步骤3.如果该单元确实存在,请记下存储在其中的值.
  2. 左上角单元格中的值是否等于当前单元格中的值?如果是这样,请执行以下操作:
    • 记录一个空操作(即Equal).在这种情况下,无需编辑,因为此位置的字符相同.
    • 更新当前单元格,向上和向左移动.
    • 返回到步骤1.
  1. Look at the neighboring cell to the upper left. If it doesn't exist, go to step 3. If the cell does exist, note the value stored there.
  2. Is the value in the upper-left cell equal to the value in the current cell? If so, then do the following:
    • Record an empty operation (i.e. Equal). No edit was required in this case because the characters at this location are the same.
    • Update the current cell, moving up and left.
    • Return to step 1.
  • 如果左侧没有单元格,而上方没有单元格,则您位于左上角,并且算法已完成.
  • 如果左侧没有单元格,请转到第4步.(这将继续循环直到您到达左上角.)
  • 如果上方没有单元格,请转到第5步.(此过程将一直循环直到您到达左上角.)
  • 否则,请在左侧的单元格,左上方的单元格和上方的单元格之间进行三向比较.选择一个最小的值.如果有多个候选人,则可以随机选择一个.在此阶段,它们全部有效. (它们对应于具有相同总编辑距离的不同编辑路径.)
    • 如果您选择了上面的单元格,请转到步骤4.
    • 如果您选择左侧的单元格,请转到步骤5.
    • 如果您在左上方选择了单元格,请转到步骤6.
    • If there is no cell to the left and no cell above, then you are in the upper-left corner, and the algorithm has completed.
    • If there is no cell to the left, go to step 4. (This will continue in a loop until you reach the upper-left corner.)
    • If there is no cell above, go to step 5. (This will continue in a loop until you reach the upper-left corner.)
    • Otherwise, do a three-way comparison between the cell to the left, the cell to the upper left, and the cell above. Pick the one with the smallest value. If there are multiple candidates, you can pick one at random; they are all valid at this stage. (They correspond to different edit paths with the same total edit distance.)
      • If you picked the cell above, go to step 4.
      • If you picked the cell to the left, go to step 5.
      • If you picked the cell to the upper left, go to step 6.
      • 记录当前单元格上输入字符的删除.
      • 更新当前单元格,向上移动.
      • 返回到步骤1.
      • 记录输出字符在当前单元格中的插入.
      • 更新当前单元格,向左移动.
      • 返回到步骤1.
      • 记录当前单元格上输入字符的替换内容,代替当前单元格上的输出字符.
      • 更新当前单元格,向上和向左移动.
      • 返回到步骤1.

      放在一起

      在上面的示例中,有两种可能的路径:

      Putting it Together

      In the example above, there are two possible paths:

      (4, 4) -> (3, 3) -> (2, 2) -> (1, 2) -> (0, 1) -> (0, 0)
      

      (4, 4) -> (3, 3) -> (2, 2) -> (1, 1) -> (0, 0)
      

      扭转它们,我们得到

      (0, 0) -> (0, 1) -> (1, 2) -> (2, 2) -> (3, 3) -> (4, 4)
      

      (0, 0) -> (1, 1) -> (2, 2) -> (3, 3) -> (4, 4)
      

      因此对于第一个版本,我们的第一个操作是向右移动,即插入.插入的字母是h,因为我们是从isnt移到hint. (这对应于详细输出中的Insert, h.)我们的下一个操作是对角线移动,即替换或无操作.在这种情况下,这是无操作操作,因为两个位置的编辑距离都相同(即字母相同).所以Equal, i, i.然后向下移动,对应于删除.删除的字母是s,因为我们又从isnt移到了hint. (通常,要插入的字母来自输出字符串,而要删除的字母来自输入字符串.)因此是Delete, s.然后,两个对角线运动值保持不变:Equal, n, nEqual, t, t.

      So for the first version, our first operation is a movement to the right, i.e. an insertion. The letter inserted is h, since we're moving from isnt to hint. (This corresponds to Insert, h in your verbose output.) Our next operation is a diagonal movement, i.e. either a substitution, or a no-op. In this case, it's a no-op because the edit distance is the same at both locations (i.e. the letter is the same). So Equal, i, i. Then a downward movement, corresponding to a deletion. The letter deleted is s, since again, we're moving from isnt to hint. (In general, the letter to insert comes from the output string, while the letter to delete comes from the input string.) So that's Delete, s. Then two diagonal movements with no change in value: Equal, n, n and Equal, t, t.

      结果:

      Insert, h
      Equal, i, i
      Delete, s
      Equal, n, n
      Equal, t, t
      

      isnt上执行以下说明:

      isnt   (No change)
      hisnt  (Insertion)
      hisnt  (No change)
      hint   (Deletion)
      hint   (No change)
      hint   (No change)
      

      总编辑距离为2.

      我将保留第二条最小路径作为练习.请记住,两条路径是完全等效的.它们可能有所不同,但是它们将导致相同的最小编辑距离2,因此可以完全互换.在遍历矩阵的任何时候,如果您看到两个可能的局部最小值,则可以取其中一个,并且保证最终结果是正确的

      I'll leave the second minimum path as an exercise. Keep in mind that both paths are completely equivalent; they may be different, but they will result in the same minimum edit distance of 2, and so are entirely interchangeable. At any point as you work backwards through the matrix, if you see two different possible local minima, you may take either one, and the final result is guaranteed to be correct

      一旦您掌握了所有这些内容,就不难编写代码了.在这种情况下,关键是首先深刻理解算法.完成此操作后,对其进行编码就很容易了.

      Once you grok all this, it shouldn't be hard to code at all. The key, in cases like this, is to deeply understand the algorithm first. Once you've done that, coding it up is a cinch.

      最后一点,您可以选择在填充矩阵时累积编辑.在这种情况下,矩阵中的每个单元格都可以是一个元组:(2, ('ins', 'eq', 'del', 'eq', 'eq')).您将增加长度,并追加与从最小先前状态开始的移动相对应的操作.这样就消除了回溯,从而降低了代码的复杂性;但它会占用额外的内存.如果这样做,最终的编辑序列将与最终的编辑距离一起出现在矩阵的右下角.

      As a final note, you might chose to accumulate the edits as you populate the matrix. In that case, each cell in your matrix could be a tuple: (2, ('ins', 'eq', 'del', 'eq', 'eq')). You would increment the length, and append the operation corresponding to a movement from the minimal previous state. That does away with the backtracking, and so decreases the complexity of the code; but it takes up extra memory. If you do this, the final edit sequence will appear along with the final edit distance in the lower right corner of the matrix.

      这篇关于最小编辑距离重建的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆