试图理解`git diff`和`git mv`重命名检测机制 [英] Trying to understand `git diff` and `git mv` rename detection mechanism
问题描述
这是我之前询问的另一个问题的后续行动。
在编辑之前,最初创建的文件 something
被重命名为 somethingelse
可以观察到此处:
git mv something somethingelse
文件 somethingelse
然后获取更名后退至 something
在第二次vim编辑之前:
$基本上在下面的部分代码:
$ b
#如果你添加了一些东西到第一行,重命名不会被Git
#检测到。但是,如果你创建了2个换行符并用新代码填充第3行,
#因任何原因检测到重命名
printf\ nCOMMAND:vim something\\\
\\\
vim something
如果现在我在代码中添加 abc
,我们最终会得到:
第一行代码。 abc
我认为第1行增加了4个字节,这:
分支大师
要提交的更改:
(使用git reset HEAD<文件> ...以取消存储)
新文件:某物
删除:somethingelse
然后,如果我们添加一个换行符并在第三行输入abc(它也应该是4个字节,如果错误的话纠正我):
第一行代码。
abc
忽然,Git会检测到重命名(包含编辑):
关于分支master
要提交的更改:
(使用git reset HEAD< file> ...停用)
改名为:somethingelse - >某些
给出解释这个在一定程度上,采取 git diff
重命名检测考虑到 git status
的临界值。
不应该Git行为相同,因为我们在两种情况下都添加了4个字节,但以不同的方式或换行符与此有关?
Git的相似指数计算不是,据我所知,除了在来源之外的任何地方记录,从diffcore-delta.c 。
要计算两个文件的相似度索引,请使用 S (来源)和 D (destination),Git:
这两个哈希表中的条目仅仅是实例出现次数(加上,如下所述,大块的长度)。
文件块的散列值由以下公式计算:
- 从当前文件偏移量开始(最初为零)
- 读取64个字节或直到
'\\ \\ n'
字符,以先到者为准 - 如果文件被声明为文本并且存在
'\r'
放在'\ n'
之前,放弃'\r'
- 使用链接文件中显示的算法对产生的64字节字符串进行散列
既然 S 和 D 有散列表,那么每个可能的散列 h i 应用程序在 S 和 n D 中的耳朵 n S / em>(可能为零,尽管代码跳过两个零散列值)。如果 D 中的出现次数小于或等于 S中出现次数 -ie,则 n D ≤ n S -then D 从 S 的副本 n D >次。如果 D 中的出现次数超过了 S 中的次数(包括 S 中的数字为零),则 D 具有散列块的 n D -n S 出现的文字增加,并且 D >也复制所有的 n S 原始事件。
每个散列块保留其输入数量字节,并且这些将块的副本数量或添加数量相乘,以获得复制或添加的字节数量。 (删除,其中 D 缺少 S 中存在的项目,这里仅具有间接效果:字节复制和添加计数变小,但Git并未专门对删除进行计数。)这两个值( src_copied
和 literal_added
)在 diffcore_count_changes
中计算得到的数据交给函数 estimate_similarity
在 diffcore-rename.c
中。它完全忽略了 literal_added
count(此计数用于决定如何构建packfile变化量,但不用重命名计分)。相反,只有 src_copied
数字很重要:
score =(int) (src_copied * MAX_SCORE / max_size);
其中 max_size
较大的两个输入文件 S 和 D 。
请注意,有一个更早的计算:
max_size =((src->尺寸> dst->尺寸)?src->尺寸:dst->尺寸);
base_size =((src-> size< dst-> size)?src-> size:dst-> size);
delta_size = max_size - base_size;
,如果这两个文件已经更改 size 太多:
if(max_size *(MAX_SCORE-minimum_score)< delta_size * MAX_SCORE)
return 0;
我们甚至不会进入 diffcore-delta.c
代码来散列它们。 minimum_score
这里是 -M
或 - 查找重命名的参数
,转换为缩放数字。 MAX_SCORE
是 60000.0
(类型 double
),所以默认当您使用默认 -M50%
时, minimum_score
是30000(60000的一半)。然而,除了CR-before-LF饮食的情况外,这种特定的捷径不应该影响更昂贵的相似度计算的结果。 同样, git status
总是使用默认值。没有旋钮可以更改阈值(也不会更改重命名队列中允许的文件数量)。如果有代码会在此处 ,设置diff选项的 rename_score
字段。
This is a followup to another question I asked before.
Before being edited, the initially created file something
gets renamed to somethingelse
which can be observed here:
git mv something somethingelse
The file somethingelse
then gets renamed back to something
before the second vim edit:
git mv somethingelse something
Basically in the following portion of the code:
# If you add something to the first line, the rename will not be detected by Git
# However, if you instead create 2 newlines and fill line 3 with new code,
# the rename gets detected for whatever reason
printf "\nCOMMAND: vim something\n\n"
vim something
If at this point I add abc
to the code, we would end up with:
First line of code. abc
Which I think is an addition of 4 bytes on line 1, which in turn will end up in this:
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
new file: something
deleted: somethingelse
Then, if we add a newline and type in abc into the third line (which should also be 4 bytes, correct me if wrong):
First line of code.
abc
Suddenly, Git will detect the rename (edit inclusive):
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
renamed: somethingelse -> something
One good answer/comment by @torek given here explains this to a certain extent, taking the git diff
rename detection treshold of git status
into consideration.
Shouldn't Git behave identically since we added 4 bytes in both cases, but in a different manner or does the newline have something to do with this?
Git's "similarity index" computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.
To compute the similarity index for two files S (source) and D (destination), Git:
- reads both files
- computes a hash table of all of the chunks of file S
- computes a second hash table of all of the chunks of file D
The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).
The hash value for a file chunk is computed by:
- start at the current file offset (initially zero)
- read 64 bytes or until
'\n'
character, whichever occurs first - if the file is claimed to be text and there is a
'\r'
before the'\n'
, discard the'\r'
- hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file
Now that there are hash tables for both S and D, each possible hash hi appears nS times in S and nD in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., nD ≤ nS—then D "copies from S" nD times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a "literal add" of nD - nS occurrences of the hashed chunk, and D also copies all nS original occurrences as well.
Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of "chunks" to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)
These two values (src_copied
and literal_added
) computed in diffcore_count_changes
are handed over to function estimate_similarity
in diffcore-rename.c
. It completely ignores the literal_added
count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the src_copied
number matters:
score = (int)(src_copied * MAX_SCORE / max_size);
where max_size
is the size in bytes of larger of the two input files S and D.
Note that there is an earlier computation:
max_size = ((src->size > dst->size) ? src->size : dst->size);
base_size = ((src->size < dst->size) ? src->size : dst->size);
delta_size = max_size - base_size;
and if the two files have changed size "too much":
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
return 0;
we never even get into the diffcore-delta.c
code to hash them. The minimum_score
here is the argument to -M
or --find-renames
, converted to a scaled number. MAX_SCORE
is 60000.0
(type double
), so the default minimum_score
, when you use the default -M50%
, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.
Again, git status
always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the rename_score
field of the diff options.
这篇关于试图理解`git diff`和`git mv`重命名检测机制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!