`git diff --patience`和`git diff --histogram`有什么区别? [英] What's the difference between `git diff --patience` and `git diff --histogram`?

查看:241
本文介绍了`git diff --patience`和`git diff --histogram`有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此早期问题问了4种不同的Git diff策略之间的差异,但唯一的区别是 myers 耐心,这在其他地方中有很好的解释。



直方图策略如何工作?与耐心区别的是什么? git-diff手册页只是说它将耐心算法扩展到支持低发生的共同因素。其他网页提到它更快,它来自JGit,但它们没有解释它的算法或结果在何处或如何与耐心



在哪里可以找到直方图算法相对于<$ c $的描述c>耐心算法,具有与 Bram Cohen's原始描述耐心算法

(如果只是执行性能问题,那么会产生不同的结果,为什么它不是作为一个新的后端实现耐心?)

解决方案

这个直方图策略是在 git 1.7.7(2011年9月),随后g描述(正如OP所述)


git diff learn a - 直方图选项使用从 jgit ,这可能会带来更好的性能。


JGit包含 src / org / eclipse / jgit / diff /HistogramDiff.java tst / org / eclipse / jgit / diff / HistogramDiffTest.java



这里的描述相当完整: b
$ b


HistogramDiff



Bram Cohe的扩展形式n的耐心差异算法。



这个实现是通过使用
Bram Cohen的博客,然后进一步扩展为支持低出现率的常见元素。 基本该算法的想法是为为序列A的每个元素创建出现的直方图。然后依次考虑序列B的每个元素。如果元素也存在于序列A中,并且发生次数较低,则这些位置被视为最长公共子序列(LCS)的候选者。

在B的扫描完成后,具有选择出现次数最少的点作为分割点。该区域在LCS周围分裂,并且该算法递归地应用于LCS之前和之后的部分。


通过总是选择最低的LCS位置如果在两个序列之间存在唯一的公共元素,则此算法的行为与Bram Cohen的耐心差异完全相同。

当不存在唯一元素时,选择最低的出现元素


这比标准Myers的 O(ND)算法产生的结果提供了更多可读的差异。



为了防止算法具有 O(N ^ 2)运行时间,直方图桶中唯一元素数量的上限是由 #setMaxChainLength(int)的

如果序列A具有超过这个散列到同一个散列桶的元素的数量,算法会将该区域传递给 #setFallbackAlgorithm(DiffAlgorithm)

如果没有配置回退算法,则该区域将作为替换编辑发出。



在扫描序列B期间,任何元素发生超过 #setMaxChainLength(int) 次不会被考虑用于LCS匹配位置,即使它在两个序列之间是共同的。这限制了序列A中必须考虑查找LCS的位置数量,并有助于保持较低的运行时间限制。



只要 #setMaxChainLength(int) 是一个小常数(如64),算法运行在 O(N * D)时间,其中<​​code> N 是输入长度的总和, D 是生成的 EditList

如果提供的 SequenceComparator 具有很好的散列函数, -performs MyersDiff ,尽管它的理论运行时间是相同的。



这个实现有一个内部限制,可以防止它处理多于268,435,456(2 ^ 28)个元素的序列







请注意,这种算法已经用于pack_check,早在2006年(git 1.3),用于 git-verify-pack -v 。这是重用于git 1.7.7中的索引包






Commit 8c912ee a>实际上引入了 - 直方图与diff:


Port JGit的HistogramDiff算法结束到C.粗体数字(TODO)显示
,它比它的 - 耐心表亲,以及默认的Meyers算法更快。



该实现已被重新编译为使用结构和指针,
而不是位掩码,因此废除了JGit的 2 ^ 28

我们还使用 xdiff 的默认哈希表实现( code> xdl_hash_bits()
with XDL_HASHLONG())。
blockquote>

commit 8555123(git 1.7.10,April 2012)补充:


8c912ee(示教 - 直方图至 diff ,2011-07-12 )声称直方图diff
比Myers和耐心更快。



我们已经加入了一个性能测试框架,因此添加一个
测试来比较在真正的 log -p '
工作负载中执行各种diff任务。

这确实表明直方图diff略微超过了Myers,而耐心比其他人慢得多。


最后, commit 07ab4de(git 1.8.2,2013年3月) add


$ b config:引入diff.algorithm变量


某些用户或项目更喜欢使用不同于其他算法的算法例如,对myers或类似的耐心。

但是,每次使用差异时指定适当的参数是不切实际的。而且,创建别名与基于diff的其他工具(例如 git-show )并不能很好地协作。

因此,需要一个能够设置特定算法的配置变量。

现在,接受这四个值:


$ b

  • ' myers '(这与完全不设置配置变量具有相同的效果),

  • '最少',

  • '耐心'和

  • '直方图'。


提交07924d4 同时添加 - diff-algorithm 命令行选项。

作为OP Stuart P. Bentley 提及在评论中:


您可以将Git配置为默认使用直方图



  git config --global diff.algorithm直方图






更新:Git 2.12(2017年第1季度)将退出快速散列,在某些特殊情况下会出现灾难性的性能问题。 提交1f7c926 (2016年12月1日)作者: Jeff King( peff
(由 Junio C Hamano - gitster 合并) - 提交731490b ,2016年12月19日) < / p>


xdiff :drop XDL_FAST_HASH



xdiff 代码散列diff两侧的每一行,然后比较这些散列找到重复的。整体性能取决于我们可以计算散列的速度,也取决于我们看到多少散列冲突。

XDL_FAST_HASH 是加快哈希计算。

但生成的哈希具有更差的碰撞行为。这意味着,在某些情况下,它会加速差异(在 git.git 上运行 git log -p >) 〜8%),但在其他情况下,它会减慢速度。 一个病态案例超过100倍放缓



可能有一个更好的散列函数可以涵盖两个属性,但同时我们最好使用原始散列。在一般情况下,它会稍微慢一些,但它的病例较少。



This earlier question asked for the differences between 4 different Git diff strategies, but the only difference that was explained was the difference between myers and patience, which is pretty well explained elsewhere.

How does the histogram strategy work? What differentiates it from patience? The git-diff man page only says that it "extends the patience algorithm to "support low-occurrence common elements"." Other pages mention that it's faster, and that it comes from JGit, but they don't explain where or how its algorithm or results will differ from patience.

Where can I find a description of the histogram algorithm relative to the patience algorithm, with the same level of detail as Bram Cohen's original description of the patience algorithm?

(If it's just a matter of implementation performance with no case that will produce different results, why wasn't it just implemented as a new backend for patience?)

解决方案

This histogram strategy was introduced in git 1.7.7 (Sept 2011), with the following description (as mentioned by the OP)

"git diff" learned a "--histogram" option to use a different diff generation machinery stolen from jgit, which might give better performance.

JGit includes src/org/eclipse/jgit/diff/HistogramDiff.java and tst/org/eclipse/jgit/diff/HistogramDiffTest.java

The description there is fairly complete:

HistogramDiff

An extended form of Bram Cohen's patience diff algorithm.

This implementation was derived by using the 4 rules that are outlined in Bram Cohen's blog, and then was further extended to support low-occurrence common elements.

The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS).
After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.

By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen's patience diff whenever there is a unique common element available between the two sequences.
When no unique elements exist, the lowest occurrence element is chosen instead
.
This offers more readable diffs than simply falling back on the standard Myers' O(ND) algorithm would produce.

To prevent the algorithm from having an O(N^2) running time, an upper limit on the number of unique elements in a histogram bucket is configured by #setMaxChainLength(int).
If sequence A has more than this many elements that hash into the same hash bucket, the algorithm passes the region to #setFallbackAlgorithm(DiffAlgorithm).
If no fallback algorithm is configured, the region is emitted as a replace edit.

During scanning of sequence B, any element of A that occurs more than #setMaxChainLength(int) times is never considered for an LCS match position, even if it is common between the two sequences. This limits the number of locations in sequence A that must be considered to find the LCS,and helps maintain a lower running time bound.

So long as #setMaxChainLength(int) is a small constant (such as 64), the algorithm runs in O(N * D) time, where N is the sum of the input lengths and D is the number of edits in the resulting EditList.
If the supplied SequenceComparator has a good hash function, this implementation typically out-performs MyersDiff, even though its theoretical running time is the same.

This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements


Note that this kind of algo was already used for pack_check, back in 2006 (git 1.3), for git-verify-pack -v. It was reused for index-pack in git 1.7.7


Commit 8c912ee actually introduced --histogram to diff:

Port JGit's HistogramDiff algorithm over to C. Rough numbers (TODO) show that it is faster than its --patience cousin, as well as the default Meyers algorithm.

The implementation has been reworked to use structs and pointers, instead of bitmasks, thus doing away with JGit's 2^28 line limit.

We also use xdiff's default hash table implementation (xdl_hash_bits() with XDL_HASHLONG()) for convenience.

commit 8555123 (git 1.7.10, April 2012) added:

8c912ee (teach --histogram to diff, 2011-07-12) claimed histogram diff was faster than both Myers and patience.

We have since incorporated a performance testing framework, so add a test that compares the various diff tasks performed in a real 'log -p' workload.
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.

Finally, commit 07ab4de (git 1.8.2, March 2013) add

config: Introduce diff.algorithm variable

Some users or projects prefer different algorithms over others, e.g. patience over myers or similar.
However, specifying appropriate argument every time diff is to be used is impractical. Moreover, creating an alias doesn't play nicely with other tools based on diff (git-show for instance).

Hence, a configuration variable which is able to set specific algorithm is needed.
For now, these four values are accepted:

  • 'myers' (which has the same effect as not setting the config variable at all),
  • 'minimal',
  • 'patience' and
  • 'histogram'.

Commit 07924d4 added concurrently the --diff-algorithm command line option.
As the OP Stuart P. Bentley mentions in the comments:

you can configure Git to use histogram by default with:

git config --global diff.algorithm histogram


Update: Git 2.12 (Q1 2017) will retire the "fast hash" that had disastrous performance issues in some corner cases.

See commit 1f7c926 (01 Dec 2016) by Jeff King (peff). (Merged by Junio C Hamano -- gitster -- in commit 731490b, 19 Dec 2016)

xdiff: drop XDL_FAST_HASH

The xdiff code hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see.

The idea of XDL_FAST_HASH is to speed up the hash computation.
But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running "git log -p" on git.git improves by ~8% with it), but in others it can slow things down. One pathological case saw over a 100x slowdown.

There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It's slightly slower in the common case, but it has fewer surprising pathological cases.

这篇关于`git diff --patience`和`git diff --histogram`有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆