Git 的包文件是增量文件而不是快照吗? [英] Are Git's pack files deltas rather than snapshots?

查看:16
本文介绍了Git 的包文件是增量文件而不是快照吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Git 与大多数其他版本控制系统之间的一个主要区别是,其他版本控制系统倾向于将提交存储为一系列增量 - 一次提交和下一次之间的变更集.这似乎是合乎逻辑的,因为它是存储关于提交的尽可能少的信息.但是提交历史越长,比较修订范围所需的计算就越多.

One of the key differences between Git and most other version control systems is that the others tend to store commits as a series of deltas - changesets between one commit and the next. This seems logical, since it's the smallest possible amount of information to store about a commit. But the longer the commit history gets, the more calculation it takes to compare ranges of revisions.

相比之下,Git 存储每个修订版中整个项目的完整快照.这不会使 repo 大小随着每次提交而急剧增加的原因是项目中的每个文件都作为文件存储在 Git 子目录中,以其内容的哈希命名.所以如果内容没有改变,哈希也没有改变,提交只是指向同一个文件.还有其他优化.

By contrast, Git stores a complete snapshot of the whole project in each revision. The reason this doesn't make the repo size grow dramatically with each commit is each file in the project is stored as a file in the Git subdirectory, named for the hash of its contents. So if the contents haven't changed, the hash hasn't changed, and the commit just points to the same file. And there are other optimizations as well.

所有这些对我来说都很有意义,直到我偶然发现了有关包文件的信息,Git 定期将数据放入其中以节省空间:

All this made sense to me until I stumbled on this information about pack files, into which Git puts data periodically to save space:

为了节省空间,Git使用包文件.这是一个格式,其中 Git 只会保存第二部分改变的部分文件,带有指向文件的指针类似.

In order to save that space, Git utilizes the packfile. This is a format where Git will only save the part that has changed in the second file, with a pointer to the file it is similar to.

这不是基本上回到存储增量吗?如果不是,那有什么不同?这如何避免 Git 遇到与其他版本控制系统相同的问题?

例如,Subversion 使用增量,回滚 50 个版本意味着撤消 50 个差异,而使用 Git,您可以只获取适当的快照.除非 git 还在包文件中存储 50 个差异......是否有某种机制说在少量增量之后,我们将存储一个全新的快照",以便我们不会堆积太大的变更集?Git 还能如何避免增量的缺点?

For example, Subversion uses deltas, and rolling back 50 versions means undoing 50 diffs, whereas with Git you can just grab the appropriate snapshot. Unless git also stores 50 diffs in the packfiles... is there some mechanism that says "after some small number of deltas, we'll store a whole new snapshot" so that we don't pile up too large a changeset? How else might Git avoid the disadvantages of deltas?

推荐答案

总结:
Git 的包文件经过精心构建以有效地使用磁盘缓存和为常用命令和阅读最近引用的提供不错"的访问模式对象.

Summary:
Git’s pack files are carefully constructed to effectively use disk caches and provide "nice" access patterns for common commands and for reading recently referenced objects.

Git 的打包文件格式非常灵活(见 文档/技术/pack-format.txt,或 The PackfileGit 社区手册).包文件将对象存储在两个主要的方式:undeltified"(获取原始对象数据并压缩压缩它),或deltified"(形成一个对其他对象的增量然后deflate-压缩生成的增量数据).存储的对象一个包可以按任何顺序排列(它们(不一定)必须是按对象类型、对象名称或任何其他属性排序)和可以针对相同类型的任何其他合适的对象制作 deltified 对象.

Git’s pack file format is quite flexible (see Documentation/technical/pack-format.txt, or The Packfile in The Git Community Book). The pack files store objects in two main ways: "undeltified" (take the raw object data and deflate-compress it), or "deltified" (form a delta against some other object then deflate-compress the resulting delta data). The objects stored in a pack can be in any order (they do not (necessarily) have to be sorted by object type, object name, or any other attribute) and deltified objects can be made against any other suitable object of the same type.

Git 的 pack-objects 命令使用了几个 启发式提供优秀的参考地点命令.这些启发式方法控制基数的选择删除对象的对象和对象的顺序.每个机制大多是独立的,但它们有一些共同的目标.

Git’s pack-objects command uses several heuristics to provide excellent locality of reference for common commands. These heuristics control both the selection of base objects for deltified objects and the order of the objects. Each mechanism is mostly independent, but they share some goals.

Git 确实形成了 delta 压缩对象的长链,但是启发式尝试确保只有旧"对象位于长链.增量基本缓存(其大小由core.deltaBaseCacheLimit 配置变量)是自动使用并且可以大大减少所需的重建"次数需要读取大量对象的命令(例如 git log-p).

Git does form long chains of delta compressed objects, but the heuristics try to make sure that only "old" objects are at the ends of the long chains. The delta base cache (whose size is controlled by the core.deltaBaseCacheLimit configuration variable) is automatically used and can greatly reduce the number of "rebuilds" required for commands that need to read a large number of objects (e.g. git log -p).

典型的 Git 存储库存储了大量的对象,因此它无法合理地比较它们以找到对(和链)将产生最小的增量表示.

A typical Git repository stores a very large number of objects, so it can not reasonably compare them all to find the pairs (and chains) that will yield the smallest delta representations.

delta base 选择启发式基于以下思想:将在具有相似文件名的对象中找到好的 delta 基础和尺寸.每种类型的对象都单独处理(即一种类型的对象永远不会被用作另一种类型的对象).

The delta base selection heuristic is based on the idea that the good delta bases will be found among objects with similar filenames and sizes. Each type of object is processed separately (i.e. an object of one type will never be used as the delta base for an object of another type).

为了选择 delta base,对象按(主要)排序文件名,然后大小.此排序列表的窗口用于限制被视为潜在增量基础的对象数量.如果找不到对象的足够好"1 delta 表示在其窗口中的对象之间,则该对象不会是 delta压缩.

For the purposes of delta base selection, the objects are sorted (primarily) by filename and then size. A window into this sorted list is used to limit the number of objects that are considered as potential delta bases. If a "good enough"1 delta representation is not found for an object among the objects in its window, then the object will not be delta compressed.

窗口的大小由--window=选项控制git pack-objects,或 pack.window 配置变量.这delta 链的最大深度由 --depth= 控制git pack-objects 的选项,或 pack.depth 配置多变的.git gc--aggressive 选项大大放大尝试创建的窗口大小和最大深度较小的包文件.

The size of the window is controlled by the --window= option of git pack-objects, or the pack.window configuration variable. The maximum depth of a delta chain is controlled by the --depth= option of git pack-objects, or the pack.depth configuration variable. The --aggressive option of git gc greatly enlarges both the window size and the maximum depth to attempt to create a smaller pack file.

文件名排序将条目的对象聚集在一起相同的名称(或至少相似的结尾(例如 .c)).规模排序是从大到小,以便删除数据的增量是首选添加数据的增量(因为删除增量更短表示),因此较早的、较大的对象(通常较新)往往用普通压缩表示.

The filename sort clumps together the objects for entries with with identical names (or at least similar endings (e.g. .c)). The size sort is from largest to smallest so that deltas that remove data are preferred to deltas that add data (since removal deltas have shorter representations) and so that the earlier, larger objects (usually newer) tend to be represented with plain compression.

1什么是足够好"取决于所讨论对象的大小及其潜在的 delta 基数,以及由此产生的 delta 链有多深.

1 What qualifies as "good enough" depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.

对象以最近引用"的方式存储在包文件中命令.重建最近的历史所需的对象是放在包中的较早位置,它们将靠得很近.这通常适用于操作系统磁盘缓存.

Objects are stored in the pack files in a "most recently referenced" order. The objects needed to reconstruct the most recent history are placed earlier in the pack and they will be close together. This usually works well for OS disk caches.

所有提交对象都按提交日期排序(最近的在前)并存储在一起.这种放置和排序优化了磁盘遍历历史图和提取基本提交所需的访问权限信息(例如 git log).

All the commit objects are sorted by commit date (most recent first) and stored together. This placement and ordering optimizes the disk accesses needed to walk the history graph and extract basic commit information (e.g. git log).

树和 blob 对象从树开始存储第一次存储(最近)提交.每棵树都经过深度处理第一种方式,存储任何尚未被存储的对象存储.这将放置重建所需的所有树和 blob最近一次提交在一个地方.任何树和斑点尚未保存但稍后提交所需的是接下来按已排序的提交顺序存储.

The tree and blob objects are stored starting with the tree from the first stored (most recent) commit. Each tree is processed in a depth first fashion, storing any objects that have not already been stored. This puts all the trees and blobs required to reconstruct the most recent commit together in one place. Any trees and blobs that have not yet been saved but that are required for later commits are stored next, in the sorted commit order.

最终对象排序受delta base选择的影响很小因为如果为增量表示选择了一个对象及其基础对象还没有被存储,那么它的基础对象被存储在紧接在deltified 对象本身.这可以防止由于以下原因可能导致的磁盘缓存未命中读取基本对象所需的非线性访问稍后存储在包文件中.

The final object ordering is slightly affected by the delta base selection in that if an object is selected for delta representation and its base object has not been stored yet, then its base object is stored immediately before the deltified object itself. This prevents likely disk cache misses due to the non-linear access required to read a base object that would have "naturally" been stored later in the pack file.

这篇关于Git 的包文件是增量文件而不是快照吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆