试图理解`git diff`和`git mv`重命名检测机制 [英] Trying to understand `git diff` and `git mv` rename detection mechanism

查看:148
本文介绍了试图理解`git diff`和`git mv`重命名检测机制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前询问的另一个问题的后续行动。



在编辑之前,最初创建的文件 something 被重命名为 somethingelse 可以观察到此处

  git mv something somethingelse 

文件 somethingelse 然后获取更名后退 something 在第二次vim编辑之前:

 




$基本上在下面的部分代码
$ b

 #如果你添加了一些东西到第一行,重命名不会被Git 
#检测到。但是,如果你创建了2个换行符并用新代码填充第3行,
#因任何原因检测到重命名
printf\ nCOMMAND:vim something\\\
\\\

vim something

如果现在我在代码中添加 abc ,我们最终会得到:

 第一行代码。 abc 

我认为第1行增加了4个字节,这:

 分支大师
要提交的更改:
(使用git reset HEAD<文件> ...以取消存储)

新文件:某物
删除:somethingelse

然后,如果我们添加一个换行符并在第三行输入abc(它也应该是4个字节,如果错误的话纠正我):

 第一行代码。 

abc

忽然,Git会检测到重命名(包含编辑):

 关于分支master 
要提交的更改:
(使用git reset HEAD< file> ...停用)

改名为:somethingelse - >某些

给出解释这个在一定程度上,采取 git diff 重命名检测考虑到 git status 的临界值。



不应该Git行为相同,因为我们在两种情况下都添加了4个字节,但以不同的方式或换行符与此有关?

解决方案

Git的相似指数计算不是,据我所知,除了在来源之外的任何地方记录,从diffcore-delta.c



要计算两个文件的相似度索引,请使用 S (来源)和 D (destination),Git:


  • 读取fil

  • 计算所有文件块的散列表

  • 计算第二个散列表文件的大块 D



这两个哈希表中的条目仅仅是实例出现次数(加上,如下所述,大块的长度)。

文件块的散列值由以下公式计算:




  • 从当前文件偏移量开始(最初为零)
  • 读取64个字节或直到'\\ \\ n'字符,以先到者为准

  • 如果文件被声明为文本并且存在'\r'放在'\ n'之前,放弃'\r'

  • 使用链接文件中显示的算法对产生的64字节字符串进行散列



既然 S D 有散列表,那么每个可能的散列 h i 应用程序在 S n D 中的耳朵 n S / em>(可能为零,尽管代码跳过两个零散列值)。如果 D 中的出现次数小于或等于 S中出现次数 -ie,则 n D ≤ n S -then D S 的副本 n D >次。如果 D 中的出现次数超过了 S 中的次数(包括 S 中的数字为零),则 D 具有散列块的 n D -n S 出现的文字增加,并且 D >也复制所有的 n S 原始事件。

每个散列块保留其输入数量字节,并且这些将块的副本数量或添加数量相乘,以获得复制或添加的字节数量。 (删除,其中 D 缺少 S 中存在的项目,这里仅具有间接效果:字节复制和添加计数变小,但Git并未专门对删除进行计数。)



这两个值( src_copied literal_added )在 diffcore_count_changes 中计算得到的数据交给函数 estimate_similarity diffcore-rename.c 中。它完全忽略了 literal_added count(此计数用于决定如何构建packfile变化量,但不用重命名计分)。相反,只有 src_copied 数字很重要:

  score =(int) (src_copied * MAX_SCORE / max_size); 

其中 max_size 较大的两个输入文件 S D

请注意,有一个更早的计算:

  max_size =((src->尺寸> dst->尺寸)?src->尺寸:dst->尺寸); 
base_size =((src-> size< dst-> size)?src-> size:dst-> size);
delta_size = max_size - base_size;

,如果这两个文件已经更改 size 太多:

  if(max_size *(MAX_SCORE-minimum_score)< delta_size * MAX_SCORE)
return 0;

我们甚至不会进入 diffcore-delta.c 代码来散列它们。 minimum_score 这里是 -M - 查找重命名的参数,转换为缩放数字。 MAX_SCORE 60000.0 (类型 double ),所以默认当您使用默认 -M50%时, minimum_score 是30000(60000的一半)。然而,除了CR-before-LF饮食的情况外,这种特定的捷径不应该影响更昂贵的相似度计算的结果。 同样, git status 总是使用默认值。没有旋钮可以更改阈值(也不会更改重命名队列中允许的文件数量)。如果有代码会在此处 ,设置diff选项的 rename_score 字段。


This is a followup to another question I asked before.

Before being edited, the initially created file something gets renamed to somethingelse which can be observed here:

git mv something somethingelse

The file somethingelse then gets renamed back to something before the second vim edit:

git mv somethingelse something

Basically in the following portion of the code:

# If you add something to the first line, the rename will not be detected by Git
# However, if you instead create 2 newlines and fill line 3 with new code,
# the rename gets detected for whatever reason
printf "\nCOMMAND: vim something\n\n"
vim something

If at this point I add abc to the code, we would end up with:

First line of code. abc

Which I think is an addition of 4 bytes on line 1, which in turn will end up in this:

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   something
        deleted:    somethingelse

Then, if we add a newline and type in abc into the third line (which should also be 4 bytes, correct me if wrong):

First line of code.

abc

Suddenly, Git will detect the rename (edit inclusive):

On branch master
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        renamed:    somethingelse -> something

One good answer/comment by @torek given here explains this to a certain extent, taking the git diff rename detection treshold of git status into consideration.

Shouldn't Git behave identically since we added 4 bytes in both cases, but in a different manner or does the newline have something to do with this?

解决方案

Git's "similarity index" computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.

To compute the similarity index for two files S (source) and D (destination), Git:

  • reads both files
  • computes a hash table of all of the chunks of file S
  • computes a second hash table of all of the chunks of file D

The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).

The hash value for a file chunk is computed by:

  • start at the current file offset (initially zero)
  • read 64 bytes or until '\n' character, whichever occurs first
  • if the file is claimed to be text and there is a '\r' before the '\n', discard the '\r'
  • hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file

Now that there are hash tables for both S and D, each possible hash hi appears nS times in S and nD in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., nD ≤ nS—then D "copies from S" nD times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a "literal add" of nD - nS occurrences of the hashed chunk, and D also copies all nS original occurrences as well.

Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of "chunks" to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)

These two values (src_copied and literal_added) computed in diffcore_count_changes are handed over to function estimate_similarity in diffcore-rename.c. It completely ignores the literal_added count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the src_copied number matters:

score = (int)(src_copied * MAX_SCORE / max_size);

where max_size is the size in bytes of larger of the two input files S and D.

Note that there is an earlier computation:

max_size = ((src->size > dst->size) ? src->size : dst->size);
base_size = ((src->size < dst->size) ? src->size : dst->size);
delta_size = max_size - base_size;

and if the two files have changed size "too much":

if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
        return 0;

we never even get into the diffcore-delta.c code to hash them. The minimum_score here is the argument to -M or --find-renames, converted to a scaled number. MAX_SCORE is 60000.0 (type double), so the default minimum_score, when you use the default -M50%, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.

Again, git status always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the rename_score field of the diff options.

这篇关于试图理解`git diff`和`git mv`重命名检测机制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆