git如何将blob与提交树中的文件匹配? [英] How does git matches blobs to files across commit trees?

查看:105
本文介绍了git如何将blob与提交树中的文件匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Git书第3.1章明确指出,只有暂存的文件才能作为blob存储在提交树中.

如果像提交对象一样,blob获得了对其内容唯一的哈希ID,那么Git如何设法跟踪跨提交的blob与文件之间的对应关系?同一文件Blob在不同提交中的哈希ID不能匹配,因为它们的内容不同.


一个简单的示例:

让我们假设我只是创建了一个没有提交的空仓库.我创建一个文件README.md,暂存并提交. Git存储一个树对象,该对象具有一个由README.md内容的哈希标识的blob.

让我们修改README.md,暂存并提交. Git存储一个树对象,该树对象的Blob由README.md的已修改内容的哈希标识.自然地,我们可以期望第二个哈希与第一个提交树中标识README.md的blob的哈希不同.

Git将如何回答有关README.md历史记录的请求?

git log README.md

我的直觉是,它会遍历提交历史并比较相关的blob,但是我看不到Git如何知道blob对应于同一文件的不同版本,除非是在普通情况下.


解决方案

这实际上是一个很好的问题.

提交的内部存储形式部分相关,因此让我们考虑一下.单个提交实际上很小.这是Git的Git存储库中的一个,即commit b5101f929789889c2e536d915698f58d5c5c6b7a :

$ git cat-file -p b5101f929789889c2e536d915698f58d5c5c6b7a | sed 's/@/ /'
tree 3f109f9d1abd310a06dc7409176a4380f16aa5f2
parent a562a119833b7202d5c9b9069d1abb40c1f9b59a
author Junio C Hamano <gitster pobox.com> 1548795295 -0800
committer Junio C Hamano <gitster pobox.com> 1548795295 -0800

Fourth batch after 2.20

Signed-off-by: Junio C Hamano <gitster pobox.com>

(sed 's/@/ /'只是为了减少Junio Hamano必须获得的电子邮件垃圾邮件的数量:-)).如您所见,提交对象通过另一提交的哈希ID a562a11983...引用其父提交对象.它还通过哈希ID引用 tree 对象,并且树对象的哈希ID以3f109f9d1a开头.我们也可以使用git cat-file -p来查看此树对象:

$ git cat-file -p 3f109f9d1a | head
100644 blob de1c8b5c77f7566d9e41949e5e397db3cc1b487c    .clang-format
100644 blob 42cdc4bbfb05934bb9c3ed2fe0e0d45212c32d7a    .editorconfig
100644 blob 9fa72ad4503031528e24e7c69f24ca92bcc99914    .gitattributes
040000 tree 7ba15927519648dbc42b15e61739cbf5aeebf48b    .github
100644 blob 0d77ea5894274c43c4b348c8b52b8e665a1a339e    .gitignore
100644 blob cbeebdab7a5e2c6afec338c3534930f569c90f63    .gitmodules
100644 blob 247a3deb7e1418f0fdcfd9719cb7f609775d2804    .mailmap
100644 blob 03c8e4c613015476fffe3f1e071c0c9d6609df0e    .travis.yml
100644 blob 8c85014a0a936892f6832c68e3db646b6f9d2ea2    .tsan-suppressions
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING

(树上有很多数据,所以我只在这里复制了前十行).

在树内,您看到模式(100644),类型(blob,这是模式所隐含的,也记录在内部Git对象中;实际上并没有存储在树对象中),哈希斑点的ID(de1c8b5c77f...)和名称(.clang-format).您还可以看到tree可以引用其他tree对象,就像.github子树一样.

如果我们使用这个特定的blob对象哈希ID,我们也可以通过哈希ID查看该对象的内容:

$ git cat-file -p de1c8b5c77f | head
# This file is an example configuration for clang-format 5.0.
#
# Note that this style definition should only be understood as a hint
# for writing new code. The rules are still work-in-progress and does
# not yet exactly match the style we have in the existing code.

# Use tabs whenever we need to fill whitespace that spans at least from one tab
# stop to the next one.
#
# These settings are mirrored in .editorconfig.  Keep them in sync.

(同样,由于文件很长,我已将副本截断了10行).

为说明起见,我们也来看看.github子树:

$ git cat-file -p 7ba15927519648dbc42b15e61739cbf5aeebf48b
100644 blob 64e605a02b71c51e9f59c429b28961c3152039b9    CONTRIBUTING.md
100644 blob adba13e5baf4603de72341068532e2c7d7d05f75    PULL_REQUEST_TEMPLATE.md

那么,Git要做的就是从需要的地方递归地读取 tree 对象. Git会将它们读入称为 index cache 的数据结构中. (从内存的角度来看,从技术上讲,这是 cache 数据结构,尽管Git文档倾向于在何时使用哪个名称上有些松懈.)因此,通过读取commit <构建的缓存例如,c0>会说名称.clang-format具有模式100644和blob-hash de1c8b5c77f7566d9e41949e5e397db3cc1b487c,而名称.github/CONTRIBUTING.md具有模式100644和blob-hash 64e605a02b71c51e9f59c429b28961c3152039b9.

请注意,实际上,各种名称组件(.githubCONTRIBUTING.md)已在内存高速缓存中合并在一起. (以磁盘格式通过算法欺骗将其压缩.)

内存缓存可帮助Git匹配文件名

最后,是内部(内存中的)高速缓存,其中包含<文件名,文件模式,blob散列>元组.如果您要求Git将提交b5101f929789889c2e536d915698f58d5c5c6b7a与其他提交进行比较,则Git还将另一个提交读入内存缓存中.另一个缓存要么有一个名为.github/CONTRIBUTING.md的条目,要么没有.

如果两个提交的文件名都具有相同的名称,则Git假定-为了进行此比较,Git现在正在做的操作,请参见下面的内容-这些是相同的文件.不管blob哈希是否相同,都是如此.

我们在这里回答的真正问题与身份有关.在版本控制系统中,文件的身份确定该文件在两个不同版本中是否为同一"文件(但是版本控制系统本身定义了版本).这与身份的基本哲学问题相关,如这篇有关有关Thesus船的思想实验的Wikipedia文章所述:我们怎么知道某个或什至是一个一个,我们认为他们是谁或什么?如果您在表弟鲍勃(Bob)很小的时候遇到了他,并且又遇到了一个名叫鲍勃(Bob)的人,他是您的表弟吗?你和他那时很小.现在您越来越大,经验也有所不同.在现实世界中,我们从环境中寻求线索:鲍勃(Bob)是父母父母的同胞的孩子吗?如果是这样,即使鲍勃(和您)现在看起来很不一样,鲍勃(鲍勃)可能也是您很久以前见过的堂兄(鲍勃).

Git当然不执行任何操作.在大多数情况下,两个文件都被命名为.github/CONTRIBUTING.md的简单事实足以将它们标识为同一文件".名称相同,所以就完成了.

git diff提供额外的服务

在我们的日常开发中,有时有时会重命名文件.出于某些原因,名为a/b.c的文件可能被重命名d/e.fd/e.c.

假设我们正在提交a123456,文件名为a/b.c.然后我们继续提交f789abc.第二个提交没有a/b.c,但是确实有d/e.f. Git会简单地从索引(缓存的磁盘形式)和工作树中删除a/b.c,然后在我们的索引和工作树中填充一个新的d/e.f,一切都很好.

但是假设我们要求Git与f789abc进行比较 a123456. Git 可以告诉我们:要将a123456更改为f789abc,请删除a/b.c并使用这些内容创建一个新的d/e.f. git checkout所做的事情就足够了.但是,如果内容完全匹配怎么办? Git告诉我们效率更高:要将a123456更改为f789abc,将a/b.c重命名为d/e.f.实际上,选项,git diff 做到这一点:

git diff --find-renames a123456 f789abc

Git如何管理这个技巧?答案在于计算文件身份.

查找文件身份

假设提交 L (用于左侧)具有某个不在提交 R (用于右侧)中的文件(a/b.c).进一步假设提交 R 包含某个不在提交 L 中的文件(d/e.f).除了立即告诉我们:您应该删除L文件并使用R文件之外,Git现在可以比较两个文件的内容.

由于Git对象哈希的性质(它们完全基于文件内容是确定性的),Git确实很容易检测到 L 中的a/b.c R 中的d/e.f 100%相同.在这种情况下,它们将具有完全相同的哈希ID!因此,Git会这样做:如果有一些文件从 L 中消失了,而另一些文件已出现在 R 中,则要求Git 查找重命名,Git检查哈希ID匹配.如果找到一些文件,它将对这些文件进行配对(并将它们从不匹配文件的队列中删除-该队列中存放着 L R 中的文件,是重命名"检测队列".

具有不同名称的那些文件已被标识为同一文件.小表弟鲍勃毕竟和大表弟鲍勃一样-除非在这种情况下,你们两个都还需要小.

因此,如果还没有 L 重命名检测将 L 中的文件与 R 中的文件配对,则Git会更加努力.现在,它将提取实际的斑点,并计算出一种匹配百分比".这使用了一个复杂的小算法,在此不再赘述,但是如果两个文件中足够的子字符串匹配,则Git会声明这些文件为50%,60%,75%或更多的 like .

在重命名队列中发现一对文件,例如彼此相似的72%,Git继续将这些文件与所有其他文件进行比较.如果发现这两个中的一个与另一个有94%的相似性,则相似性配对胜过72%的相似性配对.如果不是,那么72%的相似度就足够了(至少50%),因此Git会将这两个文件配对并声明它们具有相同的身份.

无论如何,如果匹配足够好并且是所有未配对文件中最好的,那么将采用该特定匹配.再一次,小堂兄鲍勃毕竟和大堂兄鲍勃一样.

所有个不匹配的文件对上运行此测试后,git diff获取匹配的结果并调用这些文件重命名.同样,只有在使用--find-renames(或-M)时会发生这种情况,并且可以根据需要将阈值设置为50%以外的值.

破坏不正确的比赛

git diff命令提供另一项服务.请注意,我们从假设开始,如果提交 L R 具有相同的 name 文件,则这些文件是相同的 file ,即使内容不同.但是,如果不是,那该怎么办?如果 L 中的file重命名为 R 中的bettername,并且有人在 R中创建了新的file怎么办? ?

要处理此问题,git diff提供了-B(或中断配对")选项.启用-B时,如果名称与 dis 过于相似,则按名称标识的文件将失去配对.也就是说,Git将检查两个blob哈希是否匹配,如果不匹配,则Git将计算相似性索引.如果索引低于 某个阈值,在运行--find-renames样式重命名检测器之前,Git将断开配对并将两个文件放入重命名检测队列.

作为一种特殊的改进,Git将对残破的配对进行重新配对",除非它们极为相似以至于您不希望这样做.因此,对于-B,您实际上指定了两个相似性阈值:第一个数字是何时暂时断开配对,第二个数字是何时永久断开配对.

git merge使用git diff --find-renames

使用git merge执行三路合并时,有三个输入:

  • 合并基础提交,是两个尖端提交的祖先;和
  • 左右提交--ours--theirs.

Git在内部运行两个 git diff命令.一个将碱基与 L 进行比较,另一个将碱基与 R 进行比较.

这两个差异都在启用--find-renames的情况下运行.如果从base到 L 的差异找到一个重命名,则Git知道使用该重命名中显示的更改.同样,如果从base到 R 的差异找到一个重命名,则Git会知道使用这些更改.如果两个差异都显示重命名,它将结合两组更改,并尝试(但通常会失败)合并两个重命名.

git log --follow也使用重命名检测器

在使用git log --follow时,Git遍历提交历史记录,一次提交一对(父级和子级),从父级到子级进行比较.它打开一种有限形式的重命名检测代码,以查看您正在--follow的一个文件是否在该提交对中被重命名.如果是这样,git log一旦移到父级,它就会更改要查找的名称.该技术效果很好,但是在合并时会遇到一些问题(因为合并提交有多个父项).

结论

文件身份就是所有这些.由于Git事先不知道,提交 L 中的文件a/b.c与提交 R 中的文件d/e.f是不是"文件, Git可以使用重命名检测来决定.在某些情况下(例如签出 L R 提交),这一点无关紧要.在某些情况下,例如将两个提交区分开,这很重要,但仅对我们人类试图了解所发生的事情具有重要意义.但是在某些情况下(例如合并),这非常重要.

Chapter 3.1 of the the Git book clearly states that only staged files get to be stored as blobs in the commit tree.

If, like a commit object, a blob gets a hash ID that is unique to its content, how does Git manages to keep track of a correspondence between blobs and files across commits? Hash IDs of same file blobs in different commits can not match since their contents differ.


A simple example:

Let's suppose I just created an empty repo with no commits. I create a file README.md, stage it and commit it. Git stores a tree object that has a blob identified by the hash of the contents of README.md.

Let's suppose I modify README.md, stage and commit. Git stores a tree object that has a blob identified by a hash of the modified contents of README.md. Naturally, we can expect this second hash to be different from the hash identifying the blob of README.md in the first commit tree.

How would Git answer a request about README.md history?

git log README.md

My hunch is that it walks over the commit history and compares relevant blobs, but I don't see how can Git know that the blobs correspond to different versions of the same file except in trivial cases.


解决方案

That's actually quite a good question.

The internal storage form of a commit is partly relevant, so let's consider it for a moment. An individual commit is actually pretty small. Here is one from the Git repository for Git, namely commit b5101f929789889c2e536d915698f58d5c5c6b7a:

$ git cat-file -p b5101f929789889c2e536d915698f58d5c5c6b7a | sed 's/@/ /'
tree 3f109f9d1abd310a06dc7409176a4380f16aa5f2
parent a562a119833b7202d5c9b9069d1abb40c1f9b59a
author Junio C Hamano <gitster pobox.com> 1548795295 -0800
committer Junio C Hamano <gitster pobox.com> 1548795295 -0800

Fourth batch after 2.20

Signed-off-by: Junio C Hamano <gitster pobox.com>

(the sed 's/@/ /' is just to maybe, possibly, cut down on the amount of email spam that Junio Hamano must get :-) ). As you can see here, the commit object refers its parent commit object by the other commit's hash ID, a562a11983.... It also refers to a tree object by hash ID, and the tree object's hash ID begins with 3f109f9d1a. We can look at this tree object using git cat-file -p too:

$ git cat-file -p 3f109f9d1a | head
100644 blob de1c8b5c77f7566d9e41949e5e397db3cc1b487c    .clang-format
100644 blob 42cdc4bbfb05934bb9c3ed2fe0e0d45212c32d7a    .editorconfig
100644 blob 9fa72ad4503031528e24e7c69f24ca92bcc99914    .gitattributes
040000 tree 7ba15927519648dbc42b15e61739cbf5aeebf48b    .github
100644 blob 0d77ea5894274c43c4b348c8b52b8e665a1a339e    .gitignore
100644 blob cbeebdab7a5e2c6afec338c3534930f569c90f63    .gitmodules
100644 blob 247a3deb7e1418f0fdcfd9719cb7f609775d2804    .mailmap
100644 blob 03c8e4c613015476fffe3f1e071c0c9d6609df0e    .travis.yml
100644 blob 8c85014a0a936892f6832c68e3db646b6f9d2ea2    .tsan-suppressions
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING

(the tree has quite a lot of data so I've copied only the first ten lines here).

Inside the tree, you see the mode (100644), type (blob—this is implied by the mode and is also recorded in the internal Git object; it's not actually stored in the tree object), hash ID (de1c8b5c77f...), and name (.clang-format) of a blob. You can also see that the tree can refer to additional tree objects, as is the case for the .github sub-tree.

If we take this particular blob object hash ID, we can view that object's contents by hash ID too:

$ git cat-file -p de1c8b5c77f | head
# This file is an example configuration for clang-format 5.0.
#
# Note that this style definition should only be understood as a hint
# for writing new code. The rules are still work-in-progress and does
# not yet exactly match the style we have in the existing code.

# Use tabs whenever we need to fill whitespace that spans at least from one tab
# stop to the next one.
#
# These settings are mirrored in .editorconfig.  Keep them in sync.

(again I've cut off the copy at 10 lines as the file is quite long).

Just for illustration let's look at the .github sub-tree too:

$ git cat-file -p 7ba15927519648dbc42b15e61739cbf5aeebf48b
100644 blob 64e605a02b71c51e9f59c429b28961c3152039b9    CONTRIBUTING.md
100644 blob adba13e5baf4603de72341068532e2c7d7d05f75    PULL_REQUEST_TEMPLATE.md

What Git does with these, then, is to read—recursively as needed—the tree object from a commit. Git will read these into a data structure it calls an index or cache. (The in-memory version of this is, technically speaking, the cache data structure, although Git documentation tends to be a bit loose about which names to use when.) So the cache built by reading commit b5101f929789889c2e536d915698f58d5c5c6b7a will say, for instance, that name .clang-format has mode 100644 and blob-hash de1c8b5c77f7566d9e41949e5e397db3cc1b487c, while name .github/CONTRIBUTING.md has mode 100644 and blob-hash 64e605a02b71c51e9f59c429b28961c3152039b9.

Note that the various name components (.github plus CONTRIBUTING.md) have, in effect, been joined-up in the in-memory cache. (In the on-disk format they're compressed via algorithmic trickery.)

The in-memory cache that helps Git match up file names

In the end, then, it's the internal (in-memory) cache that holds the <file-name, file-mode, blob-hash> tuples. If you ask Git to compare commit b5101f929789889c2e536d915698f58d5c5c6b7a to some other commit, Git reads the other commit into an in-memory cache as well. That other cache either has an entry named .github/CONTRIBUTING.md, or it doesn't.

If both commits have files that have the same names, Git assumes—for the purpose of this one comparison that Git is doing right now, and see below—that these are the same file. That's true whether the blob hashes are the same, or not.

The real question we're answering here has to do with identity. The identity of a file, in a version control system, determines whether that file is "the same" file in two different versions (however the version control system itself defines versions). This relates to the fundamental philosophical question of identity, as outlined in this Wikipedia article on the thought experiment about the Ship of Thesus: how do we know that something, or even someone, is who or what we think they are? If you met your cousin Bob when you and he were both very young, and you meet someone again who is named Bob, is he your cousin? You and he were tiny then; now you're larger and older, with different experiences. In the real world we seek cues from our environment: is Bob the child of people who are siblings of your parents? If so, that Bob probably is the same cousin Bob you met long ago, even if he (and you) look very different now.

Git, of course, doesn't do any of this. In most cases the simple fact that both files are named .github/CONTRIBUTING.md suffices to identify them as "the same file". The names are the same, so we're done.

git diff offers extra services

In our everyday development, we sometimes have occasion to rename a file. A file named a/b.c might be renamed to d/e.f or d/e.c for some reason.

Suppose we're on commit a123456 and the file is named a/b.c. Then we move to commit f789abc. That second commit has no a/b.c but does have a d/e.f. Git will simply remove a/b.c from our index (the on-disk form of the cache) and work-tree, and populate a new d/e.f into our index and work-tree, and all is well.

But suppose we ask Git to compare a123456 with f789abc. Git could just tell us: To change a123456 to f789abc, remove a/b.c and create a new d/e.f with these contents. That is what git checkout did and it suffices. But what if the contents exactly match? It's much more efficient for Git to tell us: To change a123456 to f789abc, rename a/b.c to d/e.f. And in fact, with the right options, git diff will do just that:

git diff --find-renames a123456 f789abc

How did Git manage this trick? The answer lies in computing file identity.

Finding file identity

Suppose that commit L (for left-side) has some file (a/b.c) that isn't in commit R (for right-side). Suppose further that commit R has some file (d/e.f) that isn't in commit L. Instead of immediately just telling us: you should remove the L file and use the R file, Git can now compare the contents of the two files.

Because of the nature of Git object hashes—they are completely deterministic, based on file contents—it's really easy for Git to detect that a/b.c in L is 100% identical to d/e.f in R. In this particular case, they will have exactly the same hash ID! So Git does that: if there's some file that's vanished from L and some other file that has appeared in R, and Git has been asked to find renames, Git checks for hash-ID matches. If it finds some, it pairs up those files (and takes them out of the queue of unmatched files—this queue, holding files from L and R, is the "rename detection queue").

Those files with differing names have been identified as the same file. Little cousin Bob is the same as big cousin Bob after all—except in this case, both of you still need to be little.

So, if this rename-detection hasn't yet paired a file in L with one in R, Git will try harder. Now it will extract the actual blobs, and compute a sort of "percentage of match". This uses a complicated little algorithm I won't describe here, but if enough sub-strings within the two files match, Git will declare the files to be 50, 60, 75, or more percent similar.

Having found one pair of files in the rename queue that are, say, 72% similar to each other, Git goes on to compare the files to all the other files as well. If it finds that one of those two is 94% similar to another, that similarity-pairing beats the 72% similarity-pairing. If not, the 72% similarity is sufficient—it's at least 50%—so Git will pair up those two files and declare that they have the same identity.

In any case, if the match is good enough and is the best one among all the unpaired files, that particular match is taken. Once again, little cousin Bob is the same as big cousin Bob after all.

After running this test on all unmatched file pairs, git diff takes the matched-up results and calls those files renamed. Again, this only happens if you use --find-renames (or -M), and you can set the threshold to something other than 50% if you like.

Breaking incorrect matches

The git diff command offers another service. Note that we started out by assuming that if commits L and R had files with the same name, those files were the same file, even if the contents differ. But what if they're not? What if file in L got renamed to bettername in R, and someone created a new file in R?

To handle this, git diff offers the -B (or "break pairing") option. With -B in effect, files that started out identified by name will have their pairing broken if they are too dis-similar. That is, Git will check whether the two blob hashes match, and if not, Git will compute a similarity index. If the index falls below some threshold, Git will break the pairing and put both files into the rename detection queue, before running the --find-renames style rename detector.

As a special twist, Git will re-pair broken pairings unless they are so extremely dissimilar that you don't want that to be done. Hence for -B you actually specify two similarity thresholds: the first number is when to tentatively break the pairing, and the second is when to permanently break it.

git merge uses git diff --find-renames

When you use git merge to perform a three-way merge, there are three inputs:

  • a merge base commit, which is an ancestor of both tip commits; and
  • a left and right commit, --ours and --theirs.

Git runs two git diff commands internally. One compares the base to L and the other compares the base to R.

Both of these diffs run with --find-renames enabled. If the diff from base to L finds a rename, Git knows to use the changes shown across that rename. Likewise, if the diff from base to R finds a rename, Git knows to use those changes. It will combine both sets of changes—and attempt (but usually fail) to combine both renames, if both diffs show a rename.

git log --follow also uses the rename detector

When using git log --follow, Git walks the commit history, one commit-pair—child-and-parent—at a time, doing diffs from parent to child. It turns on a limited form of the rename detection code to see if the one file you're --follow-ing was renamed in that commit pair. If so, as soon as git log moves to the parent, it changes which name it looks for. This technique works fairly well, but has some issues at merges (because merge commits have more than one parent).

Conclusion

File identity is what this is all about. Since Git doesn't know, a priori, that file a/b.c in commit L is or is not "the same" file as file d/e.f in commit R, Git can use rename detection to decide. In some cases—such as checking out commit L or R—this does not matter one bit. In some cases, such as diffing the two commits, it matters, but only for us as humans trying to understand what happened. But in a few cases, such as merging, it's very important.

这篇关于git如何将blob与提交树中的文件匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆