什么是git的启发式分配内容修改文件路径? [英] What's git's heuristic for assigning content modifications to file paths?

查看:89
本文介绍了什么是git的启发式分配内容修改文件路径?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

短版:


git 的源代码,我在哪里可以找到关于 git 用来将内容块与特定追踪路径名相关联的启发式方法的完整描述?







详细版本:

在下面的(Unix)shell演示交互中,两文件 a b ,都是 git-commit 'ted',那么他们被修改以便(有效地)将大部分 a 的内容转移到 b ,最后这两个文件再次被提交。



寻找的关键是第二个 git commit 以行结尾

 重命名a => b(99%)

即使没有重新命名文件发生(!?!)。






在演示演示之前,

文件内容 a b ../ A ../ B 的内容,和 ../ C 。象征性地, a b 的状态可以表示为

  ../ A + ../C  - > a 
../B - > b

就在第一次提交之前,并且

  ../ A  - > a 
../B + ../C - > b

就在第二个之前。

确定,这里是演示。






首先,显示辅助文件的内容。 / A ../ B ../ C

  head ../A ../B ../C 
#==> ../A< ==
#...

#==> ../B< ==
####

#==> ../C< ==
#===================================== ============================
#================= ================================================
#=============================================== ==================
#=========================== ======================================
#======= ================================================== ========
#===================================== ============================

(以开头的行对应于终端的输出;实际输出行没有前导#< )
$ b 接下来,我们创建文件 a b ,显示它们的内容并提交它们

  cat ../A ../C> a 
cat ../B> b
头部b
#==> a< ==
#...
#================================ =================================
#============ ================================================== ===
#========================================== =======================
#====================== ===========================================
#== ================================================== =============
#================================ =================================

#==> b <==
####

git add ab
git commit --allow-empty-message -m''
#[master(root -commit)3576df7]
#2文件已更改,8个插入(+)
#create mode 100644 a
#create mode 100644 b

接下来,我们修改文件 a b ,并显示其新内容:

  cat ../A> a 
cat ../B ../C> b
头部b
#==> a< ==
#...

#==> b< ==
####
#================================ =================================
#============ ================================================== ===
#========================================== =======================
#====================== ===========================================
#== ================================================== =============
#================================ =================================

最后,我们提交修改过的 a b ;注意 git commit 的输出:

  git add ab 
git commit --allow-empty-message -m''
#[master 25b806f]
#2文件已更改,2次插入(+),8次删除( - )
#重写a (99%)
#重命名a => b(99%)






根据我的理解, git 会处理目录结构信息(例如它跟踪的文件的路径名)作为 secondary 信息 - 或元数据(如果您将&mdash ;,与其跟踪的 primary 信息相关联),即各种内容块。



因为文件的内容和名称(包括路径名)可能在提交之间发生变化,所以 git 必须使用启发法将路径名与块关联的内容。但启发式技术本质上不能保证100%的工作时间。这种启发式方法的失败在于历史的形式不能真实地代表实际发生的事情(例如,即使没有文件被重命名,它通常也会报告文件重命名)。



这个解释的另一个确认(即一些启发式函数在起作用)是,AFAICT,如果传输块的大小不够大,则 git commit 将不包含重写/重命名行。 (我在这篇文章末尾加上了这个案例的示范,FWIW。)


我的问题是这样的: code> git 的源代码,我在哪里可以找到关于 git 用于将内容块与具体跟踪的路径名称?







第二个演示程序与第一个演示程序完全相同,除了辅助文件 ../ C 比以前短一行。

  head ../A ../B ../C 
#==> ../A< ==
#...

#==> ../B< ==
####

#==> ../C< ==
#===================================== ============================
#================= ================================================
#=============================================== ==================
#=========================== ======================================
#======= ================================================== ========

cat ../A ../C> a
cat ../B> b
头部b
#==> a< ==
#...
#================================ =================================
#============ ================================================== ===
#========================================== =======================
#====================== ===========================================
#== ================================================== =============

#==> b< ==
####

git add。
git commit -a --allow-empty-message -m''
#[master(root-commit)a06a689]
#2文件已更改,7次插入(+)
#创建模式100644 a
#创建模式100644 b

cat ../A> a
cat ../B ../C> b
头部b
#==> a< ==
#...

#==> b< ==
####
#================================ =================================
#============ ================================================== ===
#========================================== =======================
#====================== ===========================================
#== ================================================== =============

git add。
git commit -a --allow-empty-message -m''
#[master 87415a1]
#2文件已更改,5次插入(+),5次删除( - )


解决方案

正如您注意到的,Git使用启发式执行重命名检测,比被告知重命名发生。实际上, git mv 命令简单地在新文件路径上添加一个添加,并删除旧文件路径。因此,重命名检测是通过比较添加文件的内容和以前提交的已删除文件的内容来执行的。

首先,收集候选人。任何新文件都可能是重命名目标,任何已删除文件都可能是重命名源。此外,重写更改被破坏,使得与之前版本相比超过50%的文件既是可能的重命名源,也是可能的重命名目标。



下一步,检测到相同的重命名。如果您重命名文件而不进行任何更改,那么该文件将以相同的方式进行哈希。这些可以在没有读取文件内容的情况下仅对索引中的散列进行比较而被检测到,因此从候选列表中删除这些将减少您需要执行的比较次数。



<最后,执行相似性比较。每个候选文件中的每一行都被散列并收集在一个排序列表中。长行分为60个字符。只有空白行可能会被剥离,因为它们对相似性匹配没有太大贡献。来自每个候选源的线哈希与每个候选目标的线哈希进行比较。如果两个列表有60%相似,那么它们将被视为重命名。


Short version:

short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?


Detailed version:

In the (Unix) shell demo interaction below, two files, a and b, are "git-commit'ted", then they are modified so as to (effectively) transfer most of a's content to b, and finally the two files are once more commited.

The key thing to look for is that the output of the second git commit ends with the line

rename a => b (99%)

even though no renaming of files (in the usual sense) ever took place (!?!).


Before showing the demo, this brief description may make it easier to follow.

The contents of the files a and b are generated by combining the contents of the three auxiliary files, ../A, ../B, and ../C. Symbolically, the states of a and b could be represented as

../A + ../C -> a
../B        -> b

right before the first commit, and

../A        -> a
../B + ../C -> b

right before the second one.

OK, here's the demo.


First, we display the contents of auxiliary files ../A, ../B, and ../C:

head ../A ../B ../C
# ==> ../A <==
# ...
# 
# ==> ../B <==
# ###
# 
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

(Lines beginning with # correspond to output to the terminal; the actual output lines do not have the leading #.)

Next, we create files a and b, display their contents, and commit them

cat ../A ../C > a
cat ../B      > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# 
# ==> b <==
# ###

git add a b
git commit --allow-empty-message -m ''
# [master (root-commit) 3576df7] 
#  2 files changed, 8 insertions(+)
#  create mode 100644 a
#  create mode 100644 b

Next, we modify files a and b, and display their new contents:

cat ../A      > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
#
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

Finally, we commit the modified a and b; note the output of git commit:

git add a b
git commit --allow-empty-message -m ''
# [master 25b806f] 
#  2 files changed, 2 insertions(+), 8 deletions(-)
#  rewrite a (99%)
#  rename a => b (99%)


I rationalize this behavior as follows.

As I understand it, git treats directory structure info (such as the pathnames of the files it's tracking) as secondary information—or metadata, if you will—, to be associated with the primary information it tracks, namely various chunks of content.

Since both the contents as well as the names (including pathnames) of files may change between commits, git must use heuristics to associate pathnames to chunks of content. But heuristics, by their very nature, are not guaranteed to work 100% of the time. A failure of such heuristics here takes the form of a history that does not faithfully represent what actually happened (e.g. it reports a file renaming even though no file was renamed, in the usual sense).

A further confirmation of this interpretation (namely, that some heuristics are at play) is that, AFAICT, if the size of the transferred chunk is not sufficiently large, the output of git commit will not include the rewrite/rename lines. (I include a demonstration of this case at the end of this post, FWIW.)

My question is this: short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?


This second demo is identical to the first one in every way, except that the auxiliary file ../C is one line shorter than before.

head ../A ../B ../C
# ==> ../A <==
# ...
# 
# ==> ../B <==
# ###
# 
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

cat ../A ../C > a
cat ../B      > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# 
# ==> b <==
# ###

git add .
git commit -a --allow-empty-message -m ''
# [master (root-commit) a06a689] 
#  2 files changed, 7 insertions(+)
#  create mode 100644 a
#  create mode 100644 b

cat ../A      > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
# 
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

git add .
git commit -a --allow-empty-message -m ''
# [master 87415a1] 
#  2 files changed, 5 insertions(+), 5 deletions(-)

解决方案

As you noticed, Git performs rename detection using a heuristic, rather than being told that a rename occurred. The git mv command, in fact, simply stages an add on the new file path and a remove of the old file path. Thus, rename detection is performed by comparing the contents of added files to the previously committed contents of deleted files.

First, candidates are collected. Any new files are possible rename targets and any deleted files are possible rename sources. In addition, rewriting changes are broken such that a file that is more than 50% different than its previous revision is both a possible rename source and a possible rename target.

Next, identical renames are detected. If you rename a file without making any changes, then the file will hash identically. These can be detected just performing comparisons of the hash in the index without reading the file contents, so removing these from the candidate list will reduce the number of comparisons you need to perform.

Finally, the similarity comparison is performed. Each line in each candidate file is hashed and collected in a sorted list. Long lines are split at 60 characters. Whitespace only lines may be stripped on the assumption that they don't contribute greatly to the similarity matching. The line hashes from each candidate source are compared to the line hashes from each candidate target. If two lists are 60% similar, they are deemed a rename.

这篇关于什么是git的启发式分配内容修改文件路径?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆