git的耐心差异算法的实现是否正确? [英] Is git's implementation of the patience diff algorithm correct?

查看:126
本文介绍了git的耐心差异算法的实现是否正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Stackoverflow上的这个问题似乎是耐心差异算法的应用的一个很好的候选人。然而,在测试我的潜在答案时,我发现 git diff - -patience 不符合我的预期(在这种情况下,与默认的差异算法没有区别):

  $ cat a 
/ **
*函数foo描述。
* /
函数foo(){}
$ b $ / **
*功能栏说明。
* /
功能栏(){}

$ cat b
/ **
*功能栏说明。
* /
函数bar(){}

$ git diff --no-index --patience ab
diff --git a / ab / b
index 3064e15..a93bad0 100644
--- a / a
+++ b / b
@@ -1,9 +1,4 @@
/ **
- *函数foo描述。
- * /
- function foo(){}
-
- / **
*功能栏说明。
* /
函数bar(){}

为:

  diff --git a / ab / b 
索引3064e15..a93bad0 100644
--- a / a
+++ b / b
@@ -1,8 +1,3 @@
- / **
- *函数foo描述。
- * /
- function foo(){}
-
/ **
*功能栏说明。
* /

在我的理解中,这种情况下唯一的共同行是两行提及 bar ,并且围绕这些行的最长公共上下文应该是函数 bar()及其docblock,这意味着diff应该归结为到已删除的函数 foo()以及它自己的docblock和以下空行。 / div>

没有其他人解决这个问题一段时间了,所以我会采取一些措施。这完全是纯粹的高层理论,因为我还没有阅读关于原始耐心算法的论文。

LCS(最长公共子序列)算法都是关于减少花在寻找最小编辑距离解决方案上的时间。标准(动态编程)解决方案是O( MN ),其中 M 是原始字符串中符号的数量, N 是目标字符串中的符号。在我们的例子中,符号是行,字符串是行的集合,而不是带字符的字符串(其中符号将是例如ASCII码)。我们只需填写编辑成本的 x N 矩阵;当我们完成后,我们通过在最终矩阵中向后追踪最小路径来产生实际编辑。请参阅 https://jlordiales.me/2014/03/01/dynamic例如,编程 - 编辑距离/ 。 (通过Google搜索找到的网页:这与我无关,除了现在正确地以高速扫描它,看起来是正确的。:-))



实际上,计算这个矩阵对于大文件来说相当昂贵,因为 M N 是源代码行的数量(通常大约相等):a〜4k行文件结果在矩阵中~16M的条目中,在我们可以追踪最小路径之前必须完全填入。而且,比较符号不再像比较字符那样微不足道,因为每个符号都是完整的一行。 (通常的技巧是在矩阵生成过程中对每一行进行散列和比较散列,然后在回溯过程中重新检查,如果散列误导了我们,用删除原始并插入新的替换保留不变的符号。在存在散列冲突的情况下:我们可能会得到一个非常不理想的编辑顺序,但它几乎不会是 。)

LCS通过观察保持长的公共子序列(保留所有这些线)几乎总是能够取得巨大胜利来修改矩阵计算。找到了一些很好的LCS-es,我们将问题分解为编辑非共同前缀,保留共同序列,并编辑非共同后缀:现在我们计算两个动态规划矩阵,但对于较小的问题,所以速度更快。 (当然,我们可以对前缀和后缀进行递归,如果我们有一个〜4k行的文件,并且我们发现〜2k完全不变,那么在中间附近的共同行,在顶部留下〜0.5k行, 〜1.5k的底部,我们可以检查在〜0.5k顶部有区别的行中长共同的子序列,然后再在〜1.5k底部有差异行。)



LCS做得很差,因此导致可怕的差异,当常见子序列是简单的行时,如     } ,它们有很多匹配,但并不真正相关。 耐心差异变体简单地从初始LCS计算中丢弃这些行,以便它们不是公共子序列的一部分。这使剩余的矩阵更大,这就是为什么你必须耐心等待。 : - )



结果是耐心差异在这里没有帮助,因为我们的问题与普通子序列无关。事实上,即使我们完全抛弃了LCS,只是做了一个大矩阵,我们仍然会得到不理想的结果。我们的问题是删除成本:

   -  *函数foo描述。 
- * /
- function foo(){}
-
- / **

(并且不插入任何东西)是与删除成本相同的



   -  / ** 
- *函数foo描述。
- * /
- function foo(){}
-

任何一个的成本只是删除5个符号。即使我们对每个符号加权 - 使非空行比删除空行更昂贵 - 成本保持不变:我们最后删除了相同的五行。 p>

相反,我们需要的是基于视觉聚类对线条进行加权的方法:边缘处的短线比删除更便宜中间的线 。添加到Git 2.9的压缩启发式尝试在事实之后做到这一点。显然至少有一点缺陷(只有空行数,而且必须实际存在,而不是仅仅通过达到边缘来暗示)。在矩阵填充过程中进行加权可能会更好(假设在LCS消除之后剩下的部分实际上正在经历完整的动态编程矩阵)。不过这很平淡。


This question on Stackoverflow seemed to be a good candidate for the application of the patience diff algorithm. However, while testing my potential answer, I found that git diff --patience doesn't work up to my expectations (and, in this case, is no different from the default diff algorithm):

$ cat a
/**
 * Function foo description.
 */
function foo() {}

/**
 * Function bar description.
 */
function bar() {}

$ cat b
/**
 * Function bar description.
 */
function bar() {}

$ git diff --no-index --patience a b
diff --git a/a b/b
index 3064e15..a93bad0 100644
--- a/a
+++ b/b
@@ -1,9 +1,4 @@
 /**
- * Function foo description.
- */
-function foo() {}
-
-/**
  * Function bar description.
  */
 function bar() {}

I would expect the diff to be:

diff --git a/a b/b
index 3064e15..a93bad0 100644
--- a/a
+++ b/b
@@ -1,8 +1,3 @@
-/**
- * Function foo description.
- */
-function foo() {}
-
 /**
  * Function bar description.
  */

In my understanding, unique common lines in this case are the two lines mentioning bar, and the longest common context around those lines should be the function bar() along with its docblock, which means that the diff should boil down to the removed function foo() together with its own docblock and the following blank line.

解决方案

No one else has addressed this for a while, so I'll take a stab at it. This is all purely high-level theoretical though, since I have not read the paper on the original patience algorithm.

The LCS (longest common subsequence) algorithms are all about reducing the time spent finding a minimal edit distance solution. The standard (dynamic programming) solution is O(MN) where M is the number of symbols in the original string and N is the number of symbols in the target string. In our case, the "symbols" are lines, and the "string" is the collection of lines, rather than strings with characters (where the symbols would be, e.g., ASCII codes). We simply fill in an M x N matrix of "edit costs"; when we're done, we produce the actual edit by tracing a minimal path backwards through the resulting matrix. See https://jlordiales.me/2014/03/01/dynamic-programming-edit-distance/ for an example. (Web page found via Google search: it's not something I had anything to do with, other than to scan it at high speed for correctness now. It seems correct. :-) )

Actually computing this matrix is fairly expensive for big files, since M and N are the number of source lines (usually approximately equal): a ~4k line file results in ~16M entries in the matrix, which must be filled in completely before we can trace the minimal path back. Moreover, comparing "symbols" is no longer as trivial as comparing characters, since each "symbol" is a complete line. (The usual trick is to hash each line and compare hashes instead during the matrix generation, and then re-check during traceback, replacing "keep unchanged symbol" with "delete original and insert new" if the hash misled us. This works fine even in the presence of hash collisions: we may get a very slightly sub-optimal edit sequence, but it will practically never be awful.)

LCS modifies the matrix-calculation by observing that keeping long common subsequences ("preserve all these lines") almost always results in a big win. Having found some good LCS-es, we break the problem into "edit the non-common prefix, keep the common sequence, and edit the non-common suffix": now we calculate two dynamic programming matrices, but for smaller problems, so it goes faster. (And, of course, we can recurse on the prefix and suffix. If we had a ~4k-line file and we found ~2k totally-unchanged, in-common lines near the middle, leaving ~0.5k lines at the top and ~1.5k at the bottom, we can check for long common sub-sequences in the ~0.5k "top has a difference" lines, and then again in the ~1.5k "bottom has a difference" lines.)

LCS does poorly, and thus results in terrible diffs, when the "common subsequences" are trivial lines like      }, that have lots of matches but are not really relevant. The patience diff variant simply discards these lines from the initial LCS calculation, so that they're not part of a "common subsequence". This makes the remaining matrices larger, which is why you must be patient. :-)

The result is that patience diff is no help here because our problem has nothing to do with common subsequences. In fact, even if we discarded LCSes entirely and just did one big matrix, we'd still get an undesirable result. Our problem is that the cost of deleting:

- * Function foo description.
- */
-function foo() {}
-
-/**

(and inserting nothing) is the same as the cost of deleting:

-/**
- * Function foo description.
- */
-function foo() {}
-

The cost of either one is just "delete 5 symbols". Even if we weight each symbol—make non-empty lines "more expensive" to delete than empty lines—the cost remains the same: we're deleting the same five lines, in the end.

What we need, instead, is some way to weight lines based on "visual clustering": short lines at an edge are cheaper to delete than short lines in the middle. The compaction heuristic added to Git 2.9 attempts to do this after the fact. It's obviously at least slightly flawed (only blank lines count, and they have to actually exist, not just be implied by reaching an edge). It might be better to do the weighting during matrix-filling (assuming what's left, after doing LCS elimination, really is going through the full dynamic programming matrix). This is nontrivial, though.

这篇关于git的耐心差异算法的实现是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆