Git的包文件增量而不是快照? [英] Are Git's pack files deltas rather than snapshots?

查看:139
本文介绍了Git的包文件增量而不是快照?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Git和大多数其他版本控制系统之间的主要区别之一是其他人倾向于将提交存储为一系列增量 - 在一次提交和下一次提交之间进行更改。这似乎合乎逻辑,因为它是存储关于提交的最小可能信息量。但是,提交历史记录得到的时间越长,比较修订范围所需的计算量就越多。



相比之下,Git将整个项目的完整快照存储在每次修订。这种做法不会使每个提交的repo大小显着增长,因为项目中的每个文件都存储在Git子目录中的一个文件中,并以其内容的散列命名。所以如果内容没有改变,散列没有改变,提交只是指向同一个文件。还有其他的优化。



这一切对我来说都是有道理的,直到我偶然发现


为了节省空间,Git
使用packfile。这是一个
格式,其中Git将只保存在第二个
文件中发生更改的
部分,并指向与
相似的文件。


这不是基本上回溯到存储增量吗?如果不是,它有什么不同?这是如何避免让Git遇到其他版本控制系统遇到的同样问题?

例如,Subversion使用deltas,回滚50个版本意味着撤消50差异,而与Git,你可以抓住适当的快照。除非git还在packfiles中存储了50个差异......有没有一些机制会说在一些少量的增量之后,我们会存储一个全新的快照,这样我们就不会堆放太多的变更集?如何避免Git的缺点?

解决方案

摘要:

Git的包文件是精心构建的有效地使用磁盘缓存和
为常用命令和读取最近引用的
对象提供很好访问模式。




Git的包文件
格式非常灵活(参见
文档/技术/包格式.txt
Git Community Book )中的scm.com/7_the_packfile.html\"> The Packfile )。
包文件以两种主要的
方式存储对象:undeltified(取原始对象数据和解压缩
it)或已分解(与其他对象形成一个增量那么
将生成的增量数据压缩 - 压缩)。存储在
a包中的对象可以以任意顺序(它们不一定必须是
,按对象类型,对象名称或任何其他属性排序),
分类对象可以是针对任何其他相同类型的合适对象。



Git's pack-objects 命令使用了几个 heuristics to
provides a excellent 参考资料为常见的
命令。这些启发式控制分类对象的基础
对象的选择和对象的顺序。每个
机制大多是独立的,但它们有一些共同的目标。


Git确实形成了长长的delta压缩对象链,但
试探法试图确保只有旧的对象位于
长链的末端。增量基础缓存(其大小由
core.deltaBaseCacheLimit 配置变量控制)自动使用
,并且可以大大减少重建需要读取大量对象(例如 git log
-p
)的
命令所需的。



Delta压缩启发式



典型的Git存储库存储了大量的对象,因此
无法合理地将它们全部与找到产生最小增量表示的配对(和
链)。

三角形基础选择启发式基于以下想法:
在具有相似文件名
和大小的对象中将找到好的三角洲基地。每种类型的对象都是单独处理的(即一种类型的
对象将永远不会被用作另一种类型的
对象的增量基数)。



为达到增量基数选择的目的,对象按(主要)
文件名和大小进行排序。进入这个排序列表的窗口用于限制
被视为潜在三角基的对象数。
如果在其窗口中的对象中找不到对象
的足够好 1 > delta表示,则对象不会被delta
压缩。
$ b

窗口的大小由
<$ c的 - window = 选项控制$ c> git pack-objects ,或 pack.window 配置变量。 delta链的
最大深度由 - depth =
选项 git pack-objects pack.depth 配置
变量。 - > git gc - aggressive 选项大大增加了
窗口大小和尝试的最大深度创建
a小包文件。

文件名排序将具有
相同名称(或至少类似结尾(例如, .C ))。大小
排序是从最大到最小,因此删除数据的增量是
优先于添加数据的增量(因为删除增量具有较短的
表示),因此较早的较大的对象(通常
更新)倾向于用简单压缩表示。



1
符合足够好对象的大小及其潜在的三角洲基础以及其产生的三角洲链将会有多深。

对象排序启发式



对象以最近引用的
顺序存储在包文件中。重建最近历史所需的对象是早些时候放置在包中的
,并且它们将靠近在一起。这
通常适用于操作系统磁盘缓存。



所有提交对象按提交日期(最近的第一个)
排序并一起存储。此放置和排序优化了磁盘
访问所需的访问,以便走历史图并提取基本提交
信息(例如 git log )。



树和blob对象以从
首次存储(最近)提交的树开始存储。每棵树都以深度
的第一种方式进行处理,存储尚未存储
的所有对象。这将所有的树和blob放在一起重建
最近的提交。
尚未保存的任何树和斑点,但后续提交所需的
将按照已排序的提交顺序存储。



最终的对象排序受delta基底选择
的轻微影响,因为如果为delta表示选择了一个对象并且其基础对象
尚未存储,则其基础对象将立即存储在
分离对象本身。这可以防止由于读取基本对象所需的
非线性访问而导致的磁盘缓存未命中错误,这些基本对象在包文件中稍后将存储


One of the key differences between Git and most other version control systems is that the others tend to store commits as a series of deltas - changesets between one commit and the next. This seems logical, since it's the smallest possible amount of information to store about a commit. But the longer the commit history gets, the more calculation it takes to compare ranges of revisions.

By contrast, Git stores a complete snapshot of the whole project in each revision. The reason this doesn't make the repo size grow dramatically with each commit is each file in the project is stored as a file in the Git subdirectory, named for the hash of its contents. So if the contents haven't changed, the hash hasn't changed, and the commit just points to the same file. And there are other optimizations as well.

All this made sense to me until I stumbled on this information about pack files, into which Git puts data periodically to save space:

In order to save that space, Git utilizes the packfile. This is a format where Git will only save the part that has changed in the second file, with a pointer to the file it is similar to.

Isn't this basically going back to storing deltas? If not, how is it different? How does this avoid subjecting Git to the same problems other version controls systems have?

For example, Subversion uses deltas, and rolling back 50 versions means undoing 50 diffs, whereas with Git you can just grab the appropriate snapshot. Unless git also stores 50 diffs in the packfiles... is there some mechanism that says "after some small number of deltas, we'll store a whole new snapshot" so that we don't pile up too large a changeset? How else might Git avoid the disadvantages of deltas?

解决方案

Summary:
Git’s pack files are carefully constructed to effectively use disk caches and provide "nice" access patterns for common commands and for reading recently referenced objects.


Git’s pack file format is quite flexible (see Documentation/technical/pack-format.txt, or The Packfile in The Git Community Book). The pack files store objects in two main ways: "undeltified" (take the raw object data and deflate-compress it), or "deltified" (form a delta against some other object then deflate-compress the resulting delta data). The objects stored in a pack can be in any order (they do not (necessarily) have to be sorted by object type, object name, or any other attribute) and deltified objects can be made against any other suitable object of the same type.

Git’s pack-objects command uses several heuristics to provide excellent locality of reference for common commands. These heuristics control both the selection of base objects for deltified objects and the order of the objects. Each mechanism is mostly independent, but they share some goals.

Git does form long chains of delta compressed objects, but the heuristics try to make sure that only "old" objects are at the ends of the long chains. The delta base cache (who’s size is controlled by the core.deltaBaseCacheLimit configuration variable) is automatically used and can greatly reduce the number of "rebuilds" required for commands that need to read a large number of objects (e.g. git log -p).

Delta Compression Heuristic

A typical Git repository stores a very large number of objects, so it can not reasonably compare them all to find the pairs (and chains) that will yield the smallest delta representations.

The delta base selection heuristic is based on the idea that the good delta bases will be found among objects with similar filenames and sizes. Each type of object is processed separately (i.e. an object of one type will never be used as the delta base for an object of another type).

For the purposes of delta base selection, the objects are sorted (primarily) by filename and then size. A window into this sorted list is used to limit the number of objects that are considered as potential delta bases. If a "good enough"1 delta representation is not found for an object among the objects in its window, then the object will not be delta compressed.

The size of the window is controlled by the --window= option of git pack-objects, or the pack.window configuration variable. The maximum depth of a delta chain is controlled by the --depth= option of git pack-objects, or the pack.depth configuration variable. The --aggressive option of git gc greatly enlarges both the window size and the maximum depth to attempt to create a smaller pack file.

The filename sort clumps together the objects for entries with with identical names (or at least similar endings (e.g. .c)). The size sort is from largest to smallest so that deltas that remove data are preferred to deltas that add data (since removal deltas have shorter representations) and so that the earlier, larger objects (usually newer) tend to be represented with plain compression.

1 What qualifies as "good enough" depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.

Object Ordering Heuristic

Objects are stored in the pack files in a "most recently referenced" order. The objects needed to reconstruct the most recent history are placed earlier in the pack and they will be close together. This usually works well for OS disk caches.

All the commit objects are sorted by commit date (most recent first) and stored together. This placement and ordering optimizes the disk accesses needed to walk the history graph and extract basic commit information (e.g. git log).

The tree and blob objects are stored starting with the tree from the first stored (most recent) commit. Each tree is processed in a depth first fashion, storing any objects that have not already been stored. This puts all the trees and blobs required to reconstruct the most recent commit together in one place. Any trees and blobs that have not yet been saved but that are required for later commits are stored next, in the sorted commit order.

The final object ordering is slightly affected by the delta base selection in that if an object is selected for delta representation and its base object has not been stored yet, then its base object is stored immediately before the deltified object itself. This prevents likely disk cache misses due to the non-linear access required to read a base object that would have "naturally" been stored later in the pack file.

这篇关于Git的包文件增量而不是快照?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆