尤金·迈尔斯(Eugene Myers)的Diff算法:找出最长的“ A”公共子序列。和“ B”表示 [英] Eugene Myers' Diff Algorithm: Finding the Longest Common Subsequence of "A" and "B"

查看:236
本文介绍了尤金·迈尔斯(Eugene Myers)的Diff算法:找出最长的“ A”公共子序列。和“ B”表示的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在审查 Eugene Myers的Diff算法论文。这是在流行的 diff 程序中实现的算法。

I've been reviewing Eugene Myers' Diff Algorithm Paper. This is the algorithm that is implemented in the popular diff program.

在论文的第12页上,介绍了该算法的伪代码以找到最长的公共子序列 A B

On page 12 of the paper, it presents the pseudo-code for the algorithm to find the longest common sub-sequence of A and B:

LCS(A, N, B, M)
    If N > 0 and M > 0 Then
        Find the middle snake and the length of an optimal path for A and B.
        Suppose it is from (x, y) to (u, v).
        If D > 1 Then
            LCS(A[1..x], x, B[1..y], y)
            Output A[x+1..u]
            LCS(A[u+1..N], N-u, B[v+1..M], M-v)
        Else If M > N Then
            Output A[1..N].
        Else
            Output B[1..M].

假设A = A和B = B。在这种情况下,N = 1且M =1。由于没有对角线,因此中间的蛇将为(x,y)=(0,1)和(u,v)=(0,1)。在这种情况下,D = 1,因为该算法仅采取了一步。

Suppose A = "A" and B = "B". In this case, N = 1 and M = 1. The middle snake would be (x, y) = (0, 1) and (u, v) = (0, 1) because there are no diagonals. In this case D = 1 because the algorithm has only taken one step.

算法表明,在这种情况下,唯一要做的就是输出B [1..M] 等于 B,因为N> 0,M> 0,D = 1且M =N。但这似乎是错误的,因为在 A和 B之间没有共同的子序列。该论文的评论如果D <== 1则通过删除或插入最多一个符号从A获得B是不正确的,因为必须删除 A并添加 B。

The algorithm says that the only thing to do in this scenario is to Output B[1..M], equal to "B", because N > 0, M > 0, D = 1, and M = N. But this seems wrong, because there is no common sub-sequence between "A" and "B". The paper's commentary that "If D <= 1 then B is obtained from A by either deleting or inserting at most one symbol" is incorrect because "A" must be removed and "B" added.

我在这里误解了什么?

推荐答案

您误解了D路径和蛇的定义。

You are misunderstanding the definition of D-path and snake.

从第4页开始:


让D路径作为开始的路径在(0,0)具有完全D个非对角的
边。 0路径必须仅由对角线边缘组成。通过简单的归纳法得出,D路径必须由(D − 1)路径,后跟非对角边,然后可能为空的对角边序列(称为蛇)组成

因此,在您的示例中,A = A和B = B,最佳路径是2路径(一个水平和垂直),中间的蛇是一条空弦。通过检查我们知道LCS是一个空字符串,但是我们想展示该算法可以证明这一点。

So, in your example where A = "A" and B = "B", the optimal path is a 2-path (one horizontal and one vertical), and the middle snake is an empty string. We know from inspection the LCS is an empty string, but we'd like to show the algorithm working to prove it.

首先,我们需要找到中间的蛇。如果按照第11页的算法查找中间的蛇,则会看到最短的编辑脚本的长度为2并且(x,y)=(u,v)=(1,0)或(0,1 )。换句话说,它是路径中间的一条空蛇。

First of all we need to find the middle snake. If you follow the algorithm to find the middle snake on page 11, you will see that the length of the shortest edit script is 2 and (x,y) = (u,v) = (1,0) or (0,1). In other words, it is an empty snake in the middle of the path.

算法伪代码具有一些非显而易见的符号约定:

The algorithm pseudocode has some non-obvious notational conventions:


  1. A [m..n]是一个空字符串,如果n < m。

  2. 在对LCS的递归调用中,第一个调用使用ceiling(D / 2)作为D,第二个调用使用floor(D / 2)。这在第12页的算法上方的文本中更加清楚了。

因此,在本示例情况下,假定首次调用LCS找到一条(x,y)=(u,v)=(1,0)的中间蛇,然后由于D = 2,结果是展开:

So, in this example case, assume the first call to LCS finds a middle snake with (x,y) = (u,v) = (1,0), then since D=2, the result is the expansion of:

LCS(A[1..1], 1, B[1..0], 0)  // Output nothing since M = 0
Output A[2..1]               // Output nothing since it is an empty string.
LCS(A[2..1], 0, B[1..1], 1)  // Output nothing since N = 0

这是正确的,因为这些字符串没有共同的子序列。

This is correct since these strings have no common sub-sequence.

这篇关于尤金·迈尔斯(Eugene Myers)的Diff算法:找出最长的“ A”公共子序列。和“ B”表示的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆