git binary diff 算法(增量存储)是否标准化? [英] Is the git binary diff algorithm (delta storage) standardized?

查看:31
本文介绍了git binary diff 算法(增量存储)是否标准化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Git 使用增量压缩来存储彼此相似的对象.

该算法是否标准化并在其他工具中使用?是否有描述格式的文档?它是否与 xdelta/VCDIFF/RFC 3284 兼容?

解决方案

我认为 diff 算法用于 ,现在(2018 年第二季度)指出:<块引用>

对象类型

有效的对象类型是:

  • OBJ_COMMIT (1)
  • OBJ_TREE (2)
  • OBJ_BLOB (3)
  • OBJ_TAG (4)
  • OBJ_OFS_DELTA (6)
  • OBJ_REF_DELTA (7)

Type 5 保留用于未来扩展.类型 0 无效.

Deltified 表示

概念上只有四种对象类型:提交、树、标签和斑点.
但是为了节省空间,对象可以存储为另一个基础"对象.
这些表示被分配了新的 s-delta 和 ref-delta 类型,它们只在包文件中有效.

ofs-deltaref-delta 都存储要应用于的delta"另一个对象(称为基础对象")来重建对象.
它们之间的区别是,

  • ref-delta 直接编码 20 字节的基础对象名称.
    • 如果基础对象在同一个包中,ofs-delta 会改为编码包中基础对象的偏移量.

如果基础对象在同一个包中,它也可以被删除.
Ref-delta 也可以指包外的对象(即所谓的瘦身").然而,当存储在磁盘上时,包应该自包含以避免循环依赖.

delta 数据是用于重建对象的指令序列来自基础对象.
如果基础对象被删除,则必须先将其转换为规范形式.每条指令都会向目标对象追加越来越多的数据,直到完成为止.
目前有两个支持的指令:

  • 一个用于从源对象复制一个字节范围和
  • 一种用于插入嵌入指令本身的新数据.

每条指令都有可变长度.指令类型确定由第一个八位字节的第七位.下图如下RFC 1951 中的约定(Deflate 压缩数据格式).

Git uses a delta compression to store objects that are similar to each-other.

Is this algorithm standardized and used in other tools as well? Is there documentation describing the format? Is it compatible with xdelta/VCDIFF/RFC 3284?

解决方案

I think the diff algo used for pack files was linked to one of the delta encoding out there: initially (2005) xdelta, and then libXDiff.
But then, as detailed below, it shifted to a custom implementation.

Anyway, as mentioned here:

Git does deltification only in packfiles.
But when you push via SSH git would generate a pack file with commits the other side doesn't have, and those packs are thin packs, so they also have deltas... but the remote side then adds bases to those thin packs making them standalone.

(note: creating many packfiles, or retrieving information in huge packfile is costly, and explain why git doesn't handle well huge files or huge repo.
See more at "git with large files")

This thread also reminds us:

Actually packfiles and deltification (LibXDiff, not xdelta) was, from what I remember and understand, originally because of network bandwidth (which is much more costly than disk space), and I/O performance of using single mmapped file instead of very large number of loose objects.

LibXDiff is mentioned in this 2008 thread.

However, since then, the algo has evolved, probably in a custom one, as this 2011 thread illustrates, and as the header of diff-delta.c points out:

So, strictly speaking, the current code in Git doesn't bear any resemblance with the libxdiff code at all.
However the basic algorithm behind both implementations is the same
.
Studying the libxdiff version is probably easier in order to gain an understanding of how this works.

/*
 * diff-delta.c: generate a delta between two buffers
 *
 * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
 * http://www.xmailserver.org/xdiff-lib.html
 *
 * Rewritten for GIT by Nicolas Pitre <nico@fluxnic.net>, (C) 2005-2007
 */

More on the packfiles the Git Book:


Git 2.18 adds to the delta description in this new documentation section, which now (Q2 2018) states:

Object types

Valid object types are:

  • OBJ_COMMIT (1)
  • OBJ_TREE (2)
  • OBJ_BLOB (3)
  • OBJ_TAG (4)
  • OBJ_OFS_DELTA (6)
  • OBJ_REF_DELTA (7)

Type 5 is reserved for future expansion. Type 0 is invalid.

Deltified representation

Conceptually there are only four object types: commit, tree, tag and blob.
However to save space, an object could be stored as a "delta" of another "base" object.
These representations are assigned new types ofs-delta and ref-delta, which is only valid in a pack file.

Both ofs-delta and ref-delta store the "delta" to be applied to another object (called 'base object') to reconstruct the object.
The difference between them is,

  • ref-delta directly encodes 20-byte base object name.
    • If the base object is in the same pack, ofs-delta encodes the offset of the base object in the pack instead.

The base object could also be deltified if it's in the same pack.
Ref-delta can also refer to an object outside the pack (i.e. the so-called "thin pack"). When stored on disk however, the pack should be self contained to avoid cyclic dependency.

The delta data is a sequence of instructions to reconstruct an object from the base object.
If the base object is deltified, it must be converted to canonical form first. Each instruction appends more and more data to the target object until it's complete.
There are two supported instructions so far:

  • one for copy a byte range from the source object and
  • one for inserting new data embedded in the instruction itself.

Each instruction has variable length. Instruction type is determined by the seventh bit of the first octet. The following diagrams follow the convention in RFC 1951 (Deflate compressed data format).

这篇关于git binary diff 算法(增量存储)是否标准化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆