git如何处理文件系统中的移动文件? [英] How does git handle moving files in the file system?

查看:75
本文介绍了git如何处理文件系统中的移动文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我将存储库中的文件(例如从一个文件夹移动到另一个文件夹),则git足够聪明,知道这些文件是相同的文件,而只是更新其对存储库中这些文件的引用,或者实际上是新提交创建这些文件的副本?

If I move files within a repository, such as from one folder to another, will git be smart enough to know that these are the same files and merely updates its reference to these files in the repository, or does the new commit actually create copies of these files?

我问是因为我想知道git对于二进制文件的存储有多有用.如果它将已移动的文件视为副本,那么即使您实际上并未添加任何新文件,您也可以轻松将其回购变得非常大.

I ask because I wonder how useful git is for storage of binary files. If it treats moved files as copies, then you could have a repo easily get very large even though you didn't actually add any new files.

推荐答案

要了解git如何处理这些问题,您需要了解以下两点:

To understand how git handles these, you need to know two things to start with:

  • 每个单独的文件(在任何目录中,在任何提交中)始终被单独存储.
  • 但是它是通过其对象ID存储的,该对象ID对于文件中的任何数据都是唯一的.

假设您有一个包含一个巨大文件的新仓库:

Let's say you have a new repo with one huge file in it:

$ mkdir temp; cd temp; git init
$ echo contents > bigfile; git add bigfile; git commit -m initial
[master (root-commit) d26649e] initial
 1 file changed, 1 insertion(+)
 create mode 100644 bigfile

现在,存储库只有一个提交,其中有一棵树(顶层目录),有一个文件,其中有一些唯一的对象ID. (大"文件是一个谎言,虽然很小,但是如果它的大小为数兆字节,也可以使用.)

The repo now has one commit, which has one tree (the top level directory), which has one file, which has some unique object-ID. (The "big" file is a lie, it's quite small, but it would work the same if it were many megabytes.)

现在,如果您将文件复制到第二个版本并提交:

Now if you copy the file to a second version and commit that:

$ cp bigfile bigcopy; git add bigcopy; git commit -m 'make a copy'
[master 971847d] make copy
 1 file changed, 1 insertion(+)
 create mode 100644 bigcopy

存储库现在有两个提交(很明显),两个提交(顶级目录的每个版本一个)和一个一个文件.两个副本的唯一对象ID是 same .要看到这一点,让我们查看最新的树:

the repository now has two commits (obviously), with two trees (one for each version of the top level directory), and one file. The unique object-ID is the same for both copies. To see this, let's view the latest tree:

$ git cat-file -p HEAD:
100644 blob 12f00e90b6ef79117ce6e650416b8cf517099b78    bigcopy
100644 blob 12f00e90b6ef79117ce6e650416b8cf517099b78    bigfile

那么大的SHA-1 12f00e9...是文件内容的唯一ID.如果文件确实很大,则git现在将使用工作目录的大约一半的回购空间,因为该回购仅具有文件的一个副本(名称为12f00e9...),而工作目录中有两个 .

That big SHA-1 12f00e9... is the unique ID for the file contents. If the file really were enormous, git would now be using about half as much repo space as the working directory, because the repo has only one copy of the file (under the name 12f00e9...), while the working directory has two.

但是,如果您更改文件 contents ,即使是一个位,例如将小写字母变成大写字母,则新内容将具有新的SHA-1对象ID,并且需要回购中的新副本.我们待会儿讨论.

If you change the file contents, though—even one single bit, like making a lowercase letter uppercase or something—then the new contents will have a new SHA-1 object-ID, and need a new copy in the repo. We'll get to that in a bit.

现在,假设您有一个更复杂的目录结构(一个包含更多树"对象的存储库).如果乱码文件,但是新目录中新"文件的内容(无论名称如何)与旧目录中的内容相同,则内部发生以下情况:

Now, suppose you have a more complicated directory structure (a repo with more "tree" objects). If you shuffle files around, but the contents of the "new" file(s)—under whatever name(s)—in new directories are the same as the contents that used to be in old ones, here's what happens internally:

$ mkdir A B; mv bigfile A; mv bigcopy B; git add -A .
$ git commit -m 'move stuff'
[master 82a64fe] move stuff
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename bigfile => A/bigfile (100%)
 rename bigcopy => B/bigcopy (100%)

Git已检测到(有效)重命名.让我们看看其中的一棵新树:

Git has detected the (effective) rename. Let's look at one of the new trees:

$ git cat-file -p HEAD:A
100644 blob 12f00e90b6ef79117ce6e650416b8cf517099b78    bigfile

该文件仍然在相同的旧对象ID下,因此它仍然仅在存储库中一次. git很容易检测到重命名,因为对象ID匹配,即使路径名(存储在这些树"对象中)可能不匹配.让我们做最后一件事:

The file is still under the same old object-ID, so it's still only in the repo once. It's easy for git to detect the rename, because the object-ID matches, even though the path name (as stored in these "tree" objects) might not. Let's do one last thing:

$ mv B/bigcopy B/two; git add -A .; git commit -m 'rename again'
[master 78d92d0] rename again
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename B/{bigcopy => two} (100%)

现在让我们要求HEAD~2(在重命名之前)和HEAD(在重命名之后)之间存在差异:

Now let's ask for a diff between HEAD~2 (before any renamings) and HEAD (after renaming):

$ git diff HEAD~2 HEAD
diff --git a/bigfile b/A/bigfile
similarity index 100%
rename from bigfile
rename to A/bigfile
diff --git a/bigcopy b/B/two
similarity index 100%
rename from bigcopy
rename to B/two

即使它分两个步骤完成,但git可以告诉您要从HEAD~2中的内容转到现在的HEAD中,您可以通过将bigcopy重命名为B/two一步来完成此操作.

Even though it was done in two steps, git can tell that to go from what was in HEAD~2 to what is now in HEAD, you can do it in one step by renaming bigcopy to B/two.

Git 始终进行动态重命名检测.假设我们没有进行重命名,而是在某个时候完全删除了文件,然后提交了.后来,假设放回相同的数据(以便我们获得相同的基础对象ID),然后将足够旧的版本与新的版本进行比较.在这里git表示要直接从旧版本升级到最新版本,您可以重命名文件,即使那不是我们一路走来的方式.

Git always does dynamic rename detection. Suppose that instead of doing renames, we'd removed the files entirely at some point, and committed that. Later, suppose put the same data back (so that we got the same underlying object IDs), and then diffed a sufficiently old version against the new one. Here git would say that to go directly from the old version to the newest, you could just rename the files, even if that's not how we got there along the way.

换句话说,差异总是按提交对的方式完成:在过去的某个时候,我们有A.现在我们有Z.我如何直接从A转到Z?"那时,git检查是否有重命名的可能性,并根据需要在diff输出中生成它们.

In other words, the diff is always done commit-pair-wise: "At some time in the past, we had A. Now we have Z. How do I go directly from A to Z?" At that time, git checks for the possibility of renames, and produces them in the diff output as needed.

即使文件内容有很小的变化,Git仍会(有时)显示重命名.在这种情况下,您将获得一个相似性索引".基本上,您可以告诉git,给定在版本A中删除了一些文件,在版本Z中添加了一些名称不同的文件"(在对版本A和Z进行比较时),它应该尝试对两个文件进行比较,以查看它们是否关闭"足够的".如果它们是,您将获得文件重命名然后更改"差异.对此的控制是git diff-M--find-renames参数:git diff -M80表示如果文件至少"80%相似",则将更改显示为重命名和编辑.

Git will still (sometimes) show renames even if there's some small change to a file's contents. In this case, you get a "similarity index". Basically, you can tell git that given "some file deleted in rev A, some differently-named file added in rev Z" (when diffing revs A and Z), it should try diffing the two files to see if they're "close enough". If they are, you'll get a "file renamed and then changed" diff. The control for this is the -M or --find-renames argument to git diff: git diff -M80 says to show the change as rename-and-edit if the files are at least "80% similar".

Git还将使用-C--find-copies标志查找先复制然后更改". (您可以添加--find-copies-harder来对所有文件进行计算上更昂贵的搜索;请参见

Git will also look for "copied then changed", with the -C or --find-copies flag. (You can add --find-copies-harder to do a more computationally-expensive search against all files; see the documentation.)

这也(间接地)与git如何阻止存储库的大小随着时间的推移而膨胀.

This relates (indirectly) to how git keeps repositories from blowing up in size over time, as well.

如果您有一个大文件(或什至是一个小文件)并对其进行了很小的更改,则git将使用这些对象ID存储该文件的两个完整副本.您可以在.git/objects中找到这些东西.例如,ID为12f00e90b6ef79117ce6e650416b8cf517099b78的文件位于.git/objects/12/f00e90b6ef79117ce6e650416b8cf517099b78中.它们被压缩以节省空间,但是即使压缩,一个大文件仍然可以很大.因此,如果基础对象不是非常活跃,并且经常出现在提交中,并且仅进行了一些小的更改,那么git有一种方法可以进一步压缩修改.它将它们放入打包"文件中.

If you have a large file (or even a small file) and make a small change in it, git will store two complete copies of the file, using those object-IDs. You find these things in .git/objects; for instance, that file whose ID is 12f00e90b6ef79117ce6e650416b8cf517099b78 is in .git/objects/12/f00e90b6ef79117ce6e650416b8cf517099b78. They're compressed to save space, but even compressed, a big file can still be pretty big. So, if the underlying object is not very active and appears in a lot of commits with only a few small changes every now and then, git has a way to compress the modifications even further. It puts them into "pack" files.

在打包文件中,通过将其与存储库中的其他对象进行比较来进一步压缩对象. 1 对于文本文件,很容易解释其工作原理(尽管增量压缩算法有所不同) :如果您的文件较长,并删除了第75行,则可以说使用那边的其他副本,但是删除第75行."如果添加了新行,则可以说使用其他副本,但是添加此新行".您可以使用其他大型文件为基础,将大型文件表示为指令序列.

In a pack file, the object gets further compressed by comparing it to other objects in the repository.1 For text files it's simple to explain how this works (although the delta compression algorithm is different): if you had a long file and removed line 75, you could just say "use that other copy we have over there, but remove line 75." If you added a new line, you could say "use that other copy, but add this new line." You can express large files as sequences of instructions, using other large files as the basis.

Git对所有对象(不仅是文件)执行这种压缩,因此它可以针对另一个提交压缩提交,也可以针对彼此的树进行压缩.它确实非常有效,但是存在一个问题.

Git does this sort of compression for all objects (not just files), so it can compress a commit against another commit, or trees against each other, too. It's really quite efficient, but with one problem.

某些(不是全部)二进制文件相互之间的增量压缩非常差.特别是,使用bzip2,gzip或zip等压缩文件的情况下,在任何地方进行小的更改都将同时更改文件的其余部分.图像(jpg等)通常会被压缩,并会遭受这种影响. (我不知道许多未压缩的图像格式.PBM文件是完全未压缩的,但这是我所知道的唯一仍在使用的副文件.)

Some (not all) binary files delta-compress very badly against each other. In particular, with a file that is compressed with something like bzip2, gzip, or zip, making one small change anywhere tends to change the rest of the file as well. Images (jpg's, etc) are often compressed and suffer from this sort of effect. (I don't know of many uncompressed image formats. PBM files are completely uncompressed, but that's the only one I know of off-hand that is still in use.)

如果您对二进制文件不做任何更改,则git会高效地压缩它们,因为底层的对象ID保持不变.如果您进行较小的更改,则git的压缩算法可能会(不一定会")对它们失败,因此您将获得二进制文件的多个副本.我知道,大型gzip压缩的cpio和tar存档的表现非常糟糕:对此类文件进行少量更改,将2 GB的存储库转换为4 GB的存储库.

If you make no changes at all to binary files, git compresses them super-efficiently because of the unchanging low-level object-IDs. If you make small changes, git's compression algorithms can (not necessarily "will") fail on them, so that you get multiple copies of the binaries. I know that large gzip'ed cpio and tar archives do very badly: one small change to such a file and a 2 GB repo becomes a 4 GB repo.

您必须尝试使用​​特定的二进制文件压缩得好与否.如果您只是重命名文件,则应该完全没有问题.如果您经常更改大的JPG图像,我希望它的效果很差(但是值得尝试).

Whether your particular binaries compress well or not is something you'd have to experiment with. If you're just renaming the files, you should have no problem at all. If you're changing large JPG images often, I would expect this to perform poorly (but it's worth experimenting).

1 在正常"打包文件中,一个对象只能相对于同一打包文件中的其他对象进行增量压缩.这样可以使打包文件保持独立. 瘦"包可以使用不在包文件本身中的对象.例如,与git fetch一样,它们用于通过网络进行增量更新.

1In "normal" pack files, an object can only be delta-compressed against other objects in the same pack file. This keeps the pack files stand-alone, as it were. A "thin" pack can use objects not in the pack-file itself; these are meant for incremental updates over networks, for instance, as with git fetch.

这篇关于git如何处理文件系统中的移动文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆