Git哈希重复 [英] Git hash duplicates

查看:82
本文介绍了Git哈希重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Git允许使用以下命令来检索提交的哈希值:

Git allows to retrieve the hash of the commit with commands like:

git rev-parse HEAD

给出 33b316c

git rev-parse --short HEAD

给出 33b316cbeeab3d69e79b9fb659414af4e7829a32 我知道实践中的长哈希值永远不会发生冲突.

which gives 33b316cbeeab3d69e79b9fb659414af4e7829a32 I know that long hashes in practice will never collide.

在实践中,短哈希的使用频率更高.我想知道矮个子碰撞的几率是多少?git是否采取任何措施来克服可能的冲突(例如,使用 git checkout )?

In practice, the short hashes are used much more often. I'd like to know what's the probability for the short ones to collide? Does git take any measures to overcome possible collisions (when for example using git checkout)?

推荐答案

我在我的(请参阅第78-79页),但是如果您正在寻找一种简单的书,那么在某段时间内,散列碰撞的概率达到约50%n位散列是指您大约散列2个 n/2 个密钥.SHA-1散列本身为160位,表示为40个十六进制数字,每个代表160位中的4位.截断到7个十六进制数字会留下28位,因此您将在大约2 14 键或16384个对象上达到50%的碰撞几率.如果将对象限制为仅提交,则提交的数量相当可观,但是Git将所有对象(提交,树,带注释的标记对象和Blob)放置在单个哈希索引键值存储中.

I give a formula in my book—see pp. 78-79—but if you're looking for a simple one, the point at which the probability of some hash collision reaches about 50% in an n-bit hash is when you hash roughly 2n/2 keys. The SHA-1 hash itself is 160 bits, represented as 40 hexadecimal digits, each representing 4 of the 160 bits. Truncating that to 7 hexadecimal digits leaves 28 bits, so you will reach 50%-chance-of-collision at about 214 keys, or 16384 objects. If you constrain the objects to be only commits, that's a pretty decent number of commits, but Git places all objects—commits, trees, annotated tag objects, and blobs—in a single hash-indexed key-value store.

任何给定对的键发生冲突的概率仅为2 n 中的1,即2 28 中为1或2.68亿中的1.随着密钥数量的增加,它之所以迅速增加到50%的原因,被称为生日悖论或生日问题.50%当然太吓人了;使用28位,如果我们希望总体概率低于0.1%,则应将对象数保持在1230以下.通过使用32位(8个字符缩写),我们将其加倍到大约2460,但这仍然不是很多对象.

The probability of the hashes of any given pair of keys colliding is just 1 in 2n, i.e., 1 in 228 or 1 out of 268 million. The reason it increases so fast to 50%, as the number of keys grows, is known as the Birthday Paradox or birthday problem. 50% is of course far too scary; with 28 bits, if we want the overall probability to be below 0.1%, we should keep the number of objects below about 1230. By going to 32 bits (8 character abbrevations) we double this to about 2460, but that's still not very many objects.

当您的商店中有16k个对象时,您可能应该至少使用10个十六进制数字,并给出2 40 个可能的哈希值和约0.999987794的p-bar值...(约有0.019%的碰撞机会).九个十六进制数字仅提供2 36 哈希值,产生的p条为.99804890 ...或发生碰撞的机会为0.19%,我认为这太高了.

By the time you have 16k objects in your store, you probably should use at least 10 hexadecimal digits, giving 240 possible hash values and a p-bar value of about .99987794... (about .019% chance of collisions). Nine hex digits gives only 236 hash values, producing a p-bar of .99804890... or 0.19% chance of collision, which I think is too high.

如果您可以将模糊匹配的代码限制为仅提交-或仅 commit-ish ,这在Git中表示 commit或带注释的标签-内置的默认工作挺好的.(实际上,在很多情况下,Git都会这样做.)但是,至少在我看来,Git用于计算正确"缩写长度的内部代码太过随意了,,因为它在可能使用产生的哈希值的情况下使用了50%-collision-probability平方根技巧来识别任何对象.

If you can restrict your ambiguous-matching code to only commits—or only commit-ish, which in Git means commits or annotated tags—the built in defaults work pretty well. (Git will in fact do this in a lot of cases.) But Git's internal code for computing the "right" abbreviation length is, at least in my opinion, far too care-free, too "loosey-goosey", as it uses the 50%-collision-probability square-root trick in contexts where the resulting hash might be used to identify any object.

(如注释中所述,内部 Git始终使用完整的哈希.它仅在非Git/Git界面上使用,例如, git log< hash> git show< hash> 面向用户的命令,您可以键入一个缩写的哈希值,或要求一个缩写的输出哈希值,此处Git默认使用50%冲突概率数字从估算数据库中对象的数量开始计算要显示的字符数.如果要提供哈希,请选择要提供的字符数.提供它,您仍然可以使用-abbrev = number 选择数量.请注意,绝对最小值为4: git log abc 不会将 abc 当作哈希ID,但是 git log abcd 会将 abcd 当作哈希ID的缩写.默认值为7个字符,从Git 1.7天开始).

(As noted in comments, internally Git always uses the full hashes. It's only at the not-Git / Git interface, e.g., git log <hash> or git show <hash> user-facing commands, that you can type in an abbreviated hash, or ask for an abbreviated output hash. Here Git will default to using the 50%-collision-probability number to compute how many characters to show, starting with an estimate of the number of objects in the database. If you're supplying the hash, you choose how much to supply. If you're asking Git to provide it, you can still choose how much, with --abbrev=number. Note that there's an absolute minimum of 4: git log abc won't treat abc as a hash ID, but git log abcd will treat abcd as an abbreviated a hash ID. There's also a very old default of 7 characters, from the Git 1.7-ish days.)

这篇关于Git哈希重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆