为什么git将对象存储在散列的前两个字符的目录中? [英] Why does git store objects in directories with the first two characters of the hash?

查看:147
本文介绍了为什么git将对象存储在散列的前两个字符的目录中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个基于UUID的目录结构,所以我正在研究git做什么来确定它是否是一个好的模型。



我可以看到git将对象存储在一个结构中,其中散列的前两个字符用作目录,而其余的散列是文件名。

我想知道的是为什么?如果使用这些目录有很大的好处,那么为什么不创建更多的子目录......为散列中的每个一个或两个字符创建一个目录树说一个目录?如果没有很大的优势,那么为什么前两个字符的目录?

解决方案

Git从松散对象 (如 01 / 23456789abcdef0123456789abcdef01234567 >中的文件名称)当松散对象的数量超过魔术常量(默认情况下为6700,但可配置, gc 。汽车)。由于SHA-1值趋于分布均匀,因此可以通过查看单个目录来近似总体松散对象。如果在一个对象目录中有超过(6700 + 255)/ 256 = 27个文件,那么就是打包文件的时间。



因此,没有必要对于额外的扇出( 01/23/4567 ... ):你不可能在一个目录中获得那么多对象。事实上,如果你设置的阈值高于6700,因为(27 + 255)/ 256是1,那么更大的扇出将会使它很难检测到是自动打包的时间,所以你'd要计算 01 / * / 中的所有内容,而不是 01 /



可以使用 0/1234567 ... ,并且允许每个目录最多有〜419个对象以获得相同的行为,但线性目录扫描(在任何仍然使用这些系统的系统上)是O(n <2>),而27 <2>仅仅是729,而419 <2>是175561。


I'm designing a directory structure based on UUIDs so I'm looking at what git does to see if it would be a good model.

I can see that git stores objects in a structure where the first two characters of the hash are used as a directory and the rest of the hash is the file name.

What I'm wondering is why? If there's a big advantage to using the directories why aren't more subdirectories created... say a directory for each one or two characters in the hash creating a tree? If there isn't a big advantage then why the directory with the first two chars?

解决方案

Git switches from "loose objects" (in files named like 01/23456789abcdef0123456789abcdef01234567) to "packs" when the number of loose objects exceeds a magic constant (6700 by default but configurable, gc.auto). Since SHA-1 values tend to be well-distributed it can approximate total loose objects by looking in a single directory. If there are more than (6700 + 255) / 256 = 27 files in one of the object directories, it's time for a pack-file.

Thus, there's no need for additional fan-out (01/23/4567...): it's unlikely that you will get that many objects in one directory. And in fact, greater fan-out would tend to make it harder to detect that it is time for an automatic packing, unless you set the threshold value higher (than 6700), because (27 + 255) / 256 is 1—so you'd want to count everything in 01/*/ instead of just 01/.

One could use 0/1234567... and allow up to ~419 objects per directory to get the same behavior, but linear directory scans (on any system that still uses those) are O(n2), and 272 is a mere 729, while 4192 is 175561. [Edit: that only applies to file creation, where you have a two stage search, once to find that it's OK to create and a second to find a slot or append. Lookups are still O(n).]

这篇关于为什么git将对象存储在散列的前两个字符的目录中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆