如果git不能使用文件快照,为什么.git/不会随着时间变得很大? [英] If git functions off of snapshots of files, why doesn't .git/ become huge over time?

查看:81
本文介绍了如果git不能使用文件快照,为什么.git/不会随着时间变得很大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在读git书.在这本书中,我了解到git通过为您使用的文件拍摄快照而不是像其他VCS那样的增量快照来起作用.这有一些好处.

但是,这让我感到纳闷:随着时间的流逝,包含这些快照的.git/文件夹是否不会爆炸得太大?有些存储库具有10,000或更多的提交,其中包含数百个文件.为什么git的大小不会爆炸?

解决方案

此处的诀窍是声明:

git通过为您使用的文件拍摄快照而不是像其他VCS那样的增量来实现

是非非非!

Git的主要对象数据库(一个键值存储)存储四种对象类型.我们不需要在这里详细介绍所有细节;我们可以注意到文件(或更确切地说是文件的内容)存储在 blob 对象中. Commit 对象然后(间接)引用blob对象,因此,如果您有一些名为 bigfile.txt 的文件内容并将其存储在1000个不同的提交中,则只有所有提交中都包含一个对象,重复使用了1000次.(实际上,如果您在不更改其内容的情况下将其重命名为 hugefile.txt ,则新提交 continue 会重复使用原始对象-名称分别存储在 tree 对象.)

没关系,但是随着时间的流逝,大多数项目中的大多数文件 do 都会累积更改.其他VCS将使用 delta编码,而不是存储每个文件的全新副本.避免分别存储每个文件的每个版本.如果blob对象是一个完整的,完整的(尽管是zlib压缩的)文件,那么您的问题可以归结为:单独的blob对象的堆积会使对象数据库的增长比使用增量压缩的VCS快得多吗?

答案是可以的,但是Git 确实使用增量压缩.它只是在对象数据库级别以下进行 .对象在逻辑上是独立的.您为Git提供某个对象的密钥(哈希ID),然后将整个对象取回.但是只有所谓的 loose 对象存储为简单的zlib定义文件.

Jonathan Brink指出 git gc 清除了未使用对象.这对保留的对象(例如 hugefile.txt 的旧版本或任何其他版本)没有帮助.但是 git gc (Git会在Git认为合适时自动运行)不仅仅可以修剪未引用的对象.它还运行 git repack ,它会生成或重新生成 pack文件.

一个包文件存储了多个对象,并且在一个 包文件中,对象是增量压缩的.Git细化了将放入单个pack文件中的所有对象的集合,并针对所有 N 个对象,选择其中的一些 B 集用作 delta基地.这些对象只是zlib定义的.其余的 N-B 对象被编码为增量,即相对于基数,还是针对使用这些基础的较早的增量编码对象.因此,给定存储在打包文件中的对象的密钥,Git可以找到所存储的对象或增量,如果 是什么 是一个增量,Git也可以找到底层的对象,一直到三角洲的底部,从而提取出完整的对象.

因此,Git 确实使用了delta编码,但仅在包文件中使用了 .它也不基于文件,而是基于对象,因此(至少在理论上),如果您在提交消息中有大树或长文本,则可以将其压缩彼此也是如此.

尽管这还不是全部,但为了通过网络传输,Git将构建所谓的 thin packs .常规包装和薄包装之间的主要区别在于这些增量基数.给定一个常规的打包文件和一个哈希ID,Git总是可以单独从该文件中检索完整的对象.但是,对于瘦包,允许Git使用不在该包文件中的对象(只要将瘦包运输到的另一个Git声称它已经这些对象).要求接收者在收据上固定"瘦包,但这允许 git fetch git push 发送增量而不是完整的快照.

I have been reading the git book. In this book I learned that git functions through taking snapshots of the files you work with, instead of deltas like other VCSs. This has some excellent benefits.

However, this leaves me wondering: over time, shouldn't the .git/ folder containing these snapshots blow up to be too large? There are repositories that have 10,000+ commits or more, with hundreds of files. Why doesn't git blow up in size?

解决方案

The trick here is that this claim:

git functions through taking snapshots of the files you work with, instead of deltas like other VCSs

is both true and false!

Git's main object database—a key-value store—stores four object types. We don't need to go into all the details here; we can just note that files—or more precisely, files' contents—are stored in blob objects. Commit objects then refer (indirectly) to the blob objects, so if you have some file content named bigfile.txt and store it in 1000 different commits, there's only one object in all of those commits, re-used 1000 times. (In fact, if you rename it to hugefile.txt without changing its content, new commits continue to re-use the original object—the name is stored separately, in tree objects.)

That's all fine, but over time, most files in most projects do accumulate changes. Other VCSes will, instead of storing a whole new copy of each file, make use of delta encoding to avoid storing every version of every file separately. If a blob object is a complete, intact (albeit zlib-deflated) file, your question boils down to this: wouldn't the accumulation of separate blob objects make the object database grow much faster than a VCS that uses delta compression?

The answer is that it would, but Git does use delta compression. It just does it below the level of the object database. Objects are logically independent. You give Git the key—the hash ID—for some object, and you get the entire object back. But only so-called loose objects are stored as a simple zlib-deflated file.

As Jonathan Brink noted, git gc cleans up unused objects. This does not help with retained objects, such as older versions of hugefile.txt or whatever. But git gc—which Git runs automatically whenever Git thinks it might be appropriate—does more than just prune unreferenced objects. It also runs git repack, which builds or re-builds pack files.

A pack file stores multiple objects, and inside a pack file, objects are delta-compressed. Git pores over the collection of all objects that will go into a single pack file, and for all N objects, picks some set B of them to use as delta bases. These object are merely zlib-deflated. The remaining N-B objects are encoded as deltas, against either the bases, or against earlier delta-encoded objects that use those bases. Hence, given a key for an object stored in a pack file, Git can find the stored object or delta, and if what is stored is a delta, Git can also find the underlying objects, all the way down to the delta bases, and hence extract the complete object.

Hence, Git does use delta encoding, but only within a pack file. It's also based not on files but rather on objects, so (at least in theory) if you have huge trees, or long texts inside commit messages, those can be compressed against each other as well.

Even this is not quite the whole story though: for transmission over networks, Git will build so-called thin packs. The key difference between a regular pack and a thin pack has to do with those delta bases. Given a regular pack file and a hash ID, Git can always retrieve the complete object from that file alone. With a thin pack, however, Git is allowed to use objects that are not in that pack file (as long as the other Git, to which the thin-pack is being transported, has claimed that it has those objects). The receiver is required to "fix" the thin pack on receipt, but this allows git fetch and git push to send deltas rather than complete snapshots.

这篇关于如果git不能使用文件快照,为什么.git/不会随着时间变得很大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆