如何计算提交vs树vs blob的哈希值? [英] How hash is calculated for commit vs tree vs blobs?

查看:83
本文介绍了如何计算提交vs树vs blob的哈希值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于如何计算提交,树和Blob的SHA-1哈希值,我感到困惑.根据本文,提交哈希是基于以下因素计算的:

I am confused as to how SHA-1 hashes are calculated for commits, trees, and blobs. As per this article, commit hashes are calculated based on following factors:

  1. 提交的源树(展开到所有子树和Blob)
  2. 父提交sha1
  3. 作者信息
  4. 提交者信息(对,这是不同的!)
  5. 提交消息

树和斑点散列是否也涉及相同的因素?

Are the same factors involved for tree and blob hashes as well?

推荐答案

Git有时称为内容可寻址文件系统".哈希是地址,它们基于各种对象的内容.因此,为了知道哈希是基于什么的,我们只需要知道各种对象的内容即可.

Git is sometimes called a "content-addressable filesystem". The hashes are the addresses, and they are based on the contents of the various objects. So, in order to know what the hash is based on, we only need to know the contents of the various objects.

blob 只是一个八位字节流.而已.它类似于Unix文件系统中文件内容的概念.

因此,blob的哈希仅基于其内容,blob没有元数据.

So, the hash of a blob is based solely on its contents, a blob has no metadata.

将名称和权限与其他对象(斑点或树)相关联.一棵树只是四倍体(permission, type, hash, name)的列表.例如,一棵树可能看起来像这样:

A tree associates names and permissions with other objects (blobs or trees). A tree is simply a list of quadruples (permission, type, hash, name). For example, a tree may look like this:

100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README
100644 blob 8f94139338f9404f26296befa88755fc2598c289 Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 lib

请注意第三个条目本身就是一棵树.

Note the third entry which is itself a tree.

树类似于Unix文件系统中的目录特殊文件.

A tree is analogous to a directory special file in a Unix filesystem.

同样,散列基于树的内容,这意味着树的叶子的名称,权限,类型和散列.

Again, the hash is based on the contents of the tree, which means on the names, permissions, types, and hashes of its leaves.

commit 会及时记录树的快照以及一些元数据以及快照的生成方式.提交包括:

A commit records a snapshot of a tree in time together with some metadata and how the snapshot came to be. A commit consists of:

  • (任意数量)父提交(包括零)的哈希列表
  • 树的哈希
  • 提交消息
  • 提交元数据(提交日期和提交者名称)
  • 创作元数据(创作日期和作者姓名)

提交的哈希是基于这些哈希的.

The hash of a commit is based on those.

标签并非上述意义上的对象.它们不是对象存储库的一部分,并且没有哈希.它们是对对象的引用. (请注意:尽管这是正常的用例,但是可以标记任何对象,而不仅仅是提交对象.)

Tags aren't objects in the sense above. They are not part of the object store and don't have a hash. They are references to objects. (Note: any object can be tagged, not just commits, although that is the normal use case.)

带有注释的标签是不同的:它是对象存储库的一部分.

An annotated tag is different: it is part of the object store.

带注释的标签存储:

  • 提交的哈希值
  • 标记消息
  • 标记元数据(标记名称和标记日期)

与所有其他对象一样,哈希是基于所有其他对象计算的,仅此而已.

As with all other objects, the hash is calculated based on all of them and nothing more.

签名标签就像带注释的标签一样,但是添加了加密签名.

A signed tag is like an annotated tag, but adds a cryptographic signature.

注释允许您将任意提交与任意Git对象相关联.

Notes allow you to associate an arbitrary commit with an arbitrary Git object.

笔记的存储有点复杂.实际上,注释只是一个提交(包含一棵包含blob的树,这些blob包含该注释的内容). Git为笔记创建一个特殊的分支,并且在该笔记提交及其"annotee对象"之间建立关联.我不怎么确切.

The storage of notes is a little more complicated. Actually, a note is just a commit (containing a tree containing blobs containing the contents of the note). Git creates a special branch for notes and the association between the note commit and its "annotee object" happens there. I am not familiar with exactly how.

但是,由于注释只是一个提交,并且关联发生在外部,因此注释的哈希与其他任何提交都相同.

However, since a note is just a commit, and the association happens externally, the hash of a note is just the same as any other commit.

存储格式包含一个简单的标头.实际存储(和散列)的内容是标头,后跟一个NULL八位字节,后跟对象内容.

The storage format contains a simple header. The content that is actually stored (and hashed) is the header followed by a NULL octet followed by the object contents.

标头包含对象内容的类型和长度,以ASCII编码.因此,包含以ASCII编码的字符串Hello, World的blob看起来像这样:

The header contains the type and the length of the object contents, encoded in ASCII. So, the blob which contains the string Hello, World encoded in ASCII would look like this:

blob 12\0Hello, World

是散列和存储的内容.

And that is what is hashed and stored.

其他类型的对象具有更结构化的格式,因此树对象将以标头tree <length of content in octets>\0开头,后跟严格定义的,结构化,序列化的树表示形式.

Other types of objects have a more structured format, so a tree object would start off with a header tree <length of content in octets>\0 followed by a strictly defined, structured, serialized representation of a tree.

对于提交也是如此,依此类推.

The same for commits, and so on.

大多数格式是基于简单ASCII的文本格式.例如,大小不是编码为二进制整数,而是编码为十进制整数,每个数字均表示为相应的ASCII字符.

Most formats are textual formats, based on simple ASCII. For example, the size is not encoded as a binary integer, but as a decimal integer with each digit represented as the corresponding ASCII character.

计算完哈希后,使用zlib-deflate压缩与包括标头的对象相对应的八位位组流,并将生成的八位位组流存储在基于哈希的文件中;默认在目录中

After the hash is computed, the octet stream corresponding to the object including the header is compressed using zlib-deflate, and the resulting octet stream is stored in a file based on the hash; per default in the directory

.git/objects/<first two characters of the hash>/<remaining hash>

包装

上述存储格式称为松散对象格式,因为每个对象都是单独存储的.有一种更有效的存储格式(也称为网络传输格式),称为 packfile .

Packs

The above storage format is called the loose object format, because every object is stored individually. There is a more efficient storage format (which is also used as the network transmission format), called a packfile.

打包文件是重要的速度和存储优化方法,但是它们非常复杂,因此我将不对其进行详细描述.

Packfiles are an important speed and storage optimization, but they are rather complex, so I am not going to describe them in detail.

作为第一个近似值,一个packfile由所有未压缩的对象组成,这些对象被合并为一个文件和第二个文件,第二个文件包含该对象在packfile中所处位置的索引.然后压缩整个压缩包文件,从而提高压缩率,因为该算法还可以找到对象之间的冗余,而不仅仅是单个对象内的冗余. (例如,如果您有一个Blob的两个修订版本,它们几乎是相同的……这是SCM中的一种规范.)

As a first approximation, a packfile consists of all the uncompressed objects concatenated into a single file and a second file, which contains an index of where in the packfile which object resides. The packfile as a whole is then compressed, which allows a better compression ratio, since the algorithm can also find redundancies between objects and not just within a single object. (E.g. if you have two revisions of a blob which are almost identical … which is kind of the norm in an SCM.)

它不使用zlib-deflate,而是使用二进制增量压缩算法.它还将某些启发式方法用于如何将对象放置在packfile中,以便将可能具有较大相似性的对象紧密放置在一起. (增量算法实际上无法一次看到 whole 打包文件,这会消耗过多的内存,而是在打包文件上的滑动窗口上运行;试探法试图确保相似的对象落在相同的对象内窗口.)其中一些启发式方法是:查看树与blob关联的名称,并尝试使具有相同名称的名称紧密联系在一起,尝试使具有相同文件扩展名的名称紧密联系在一起,尝试使后续修订版本保持紧密联系在一起等等.

It doesn't use zlib-deflate, rather it uses a binary delta compression algorithm. It also uses certain heuristics for how to place the objects in the packfile so that objects which are likely to have large similarity are placed closely together. (The delta algorithm cannot actually see the whole packfile at once, that would consume too much memory, rather it operates on a sliding window over the packfile; the heuristics try to ensure that similar objects land within the same window.) Some of those heuristics are: look at the names a tree associates with blobs and try to keep the ones with the same names close together, try to keep the ones with the same file extension close together, try to keep subsequent revisions close together and so on.

松散(即未打包)的对象只是zlib定义的.解压缩它们,然后看看它们以了解它们的结构.请注意,未压缩的八位位组流恰好是被散列的 ;这些对象已压缩存储,但在压缩之前先经过哈希处理.

Loose (i.e. not packed) objects are just zlib-deflated. un-deflate them and just look at them to see how they are structured. Note that the uncompressed octet stream is exactly what is being hashed; the objects are stored compressed but hashed before they are compressed.

这里是一个简单的Perl单线解压缩(是膨胀吗?)流:

perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'

这篇关于如何计算提交vs树vs blob的哈希值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆