git在文件之间重复数据删除吗? [英] Does git de-duplicate between files?

查看:127
本文介绍了git在文件之间重复数据删除吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我的存储库中包含相同文件的多个副本而只是做了很小的更改(不要问为什么),那么git是否仅通过存储文件之间的差异来节省空间?

If my repository contains several copies of the same files with only small changes (don't ask why), will git save space by only storing the differences between the files?

推荐答案

可以 ,但是很难说是否可以.在某些情况下,可以保证不会 .

It could, but it is very hard to say whether it will. There are situations where it is guaranteed that it won't.

要理解此答案(及其局限性),我们必须查看git存储对象的方式.在此stackoverflow答案 Pro Git书.

To understand this answer (and its limitations) we must look at the way git stores objects. There's a good description of the format of "git objects" (as stored in .git/objects/) in this stackoverflow answer or in the Pro Git book.

当存储像这样的松散对象"时(如git所称的活动"对象一样),如Pro Git书中所述,它们是zlib压缩的,但未进行其他压缩.因此,存储在两个不同对象中的两个不同文件(不是逐位相同)永远不会相互压缩.

When storing "loose objects" like this—which git does for what we might call "active" objects—they are zlib-deflated, as the Pro Git book says, but not otherwise compressed. So two different (not bit-for-bit identical) files stored in two different objects are never compressed against each other.

另一方面,最终可以将对象打包"到打包文件"中.有关打包文件的信息,请参见 Pro Git书的另一部分.打包文件中存储的对象是针对同一文件中的其他对象增量压缩"的.确切地说,git用于选择压缩哪些对象的标准相对于其他对象是相当模糊的.这又是Pro Git书的摘录:

On the other hand, eventually objects can be "packed" into a "pack file". See another section of the Pro Git book for information on pack files. Objects stored in pack files are "delta-compressed" against other objects in the same file. Precisely what criteria git uses for choosing which objects are compressed against which other objects is quite obscure. Here's a snippet from the Pro Git Book again:

当Git打包对象时,它将查找具有相似名称和大小的文件,并仅存储从文件的一个版本到下一个版本的增量.您可以查看packfile,看看Git为节省空间做了什么. git verify-pack plumbing命令可让您查看打包的内容[...]

When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space. The git verify-pack plumbing command allows you to see what was packed up [...]

如果git决定对大文件A的打包条目"与大文件B的打包条目"进行增量压缩,那么-并且仅 - git可以按照您要求的方式节省空间

If git decides to delta-compress "pack entry for big file A" vs "pack entry for big file B", then—and only then—can git save space in the way you asked.

每次运行git gc(或更准确地说,通过git pack-objectsgit repack时,Git都会制作打包文件);更高级别的操作(包括git gc)会在需要/适当时为您运行这些文件.这时,git会收集松散的对象,并且/或者爆炸并重新包装现有的包装.如果这时彼此关闭但不完全相同的文件相互之间进行了增量压缩,那么您可能会节省一些空间.

Git makes pack files every time git gc runs (or more precisely, through git pack-objects and git repack; higher level operations, including git gc, run these for you when needed/appropriate). At this time, git gathers up loose objects, and/or explodes and re-packs existing packs. If your close-but-not-quite-identical files get delta-compressed against each other at this point, you may see some very large space-savings.

但是,如果您随后要修改文件,则将在工作树中处理扩展和未压缩的版本,然后按git add.这将创建一个新的松散对象",并且根据定义,不会针对任何内容(没有其他松散对象,也没有任何包装)进行增量压缩.

If you then go to modify the files, though, you'll work on the expanded and uncompressed versions in your work tree and then git add the result. This will make a new "loose object", and by definition that won't be delta-compressed against anything (no other loose object, nor any pack).

克隆存储库时,通常git从要传输的对象中生成包(甚至是瘦包",这些包不是独立的),因此跨Intertubes发送的包很小尽可能.因此,即使对象在源存储库中松散,您在这里也可以受益于增量压缩.同样,一旦您开始处理这些文件(将它们转换为松散的对象),就会失去收益,并且仅当且再次打包松散的对象 git的启发式方法时,才能重新获得该好处.相互压缩.

When you clone a repository, generally git makes packs (or even "thin packs", which are packs that are not stand-alone) out of the objects to be transferred, so that what is sent across the Intertubes is as small as possible. So here you may get the benefit of delta compression even if the objects are loose in the source repository. Again, you'll lose the benefit as soon as you start working on those files (turning them into loose objects), and regain it only if-and-when the loose objects are packed again and git's heuristics compress them against each other.

这里真正的收获是,您可以使用 Pro Git书.

The real takeaway here is that to find out, you can simply try it, using the method outlined in the Pro Git book.

这篇关于git在文件之间重复数据删除吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆