Git 正在转向新的哈希算法 SHA-256 但为什么 git 社区选择了 SHA-256 [英] Git is moving to new hashing algorithm SHA-256 but why git community settled on SHA‑256

查看:78
本文介绍了Git 正在转向新的哈希算法 SHA-256 但为什么 git 社区选择了 SHA-256的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚从这个 HN-post 中了解到 git 正在转向新的哈希算法(从 SHA-1SHA-256 )

我想知道是什么让 SHA-256 最适合 git 的用例.是否有任何/许多强有力的技术原因,或者 SHA-256 流行度是否可能是一个重要因素?(我在猜测)看着 https://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions 页面我看到你有很多存在现代和旧的替代品.其中一些比 SHA-256(例如 https://crypto.stackexchange.com/q/26336 )

解决方案

我已经在为什么 Git 没有使用更现代的 SHA?"在 8 月2018年

原因由 Brian M.在此处讨论卡尔森:

<块引用>

我已经实现并测试了以下算法,所有这些都是256 位(按字母顺序):

  • BLAKE2b (libb2)
  • BLAKE2bp (libb2)
  • KangarooTwelve(从 Keccak 代码包导入)
  • SHA-256 (OpenSSL)
  • SHA-512/256 (OpenSSL)
  • SHA3-256 (OpenSSL)
  • SHAKE128 (OpenSSL)

我也拒绝了其他一些候选人.
我找不到SHA256×16的任何参考或实现,所以我没有实现它.
我没有考虑 SHAKE256 因为它几乎与 SHA3-256 几乎相同所有特性(包括性能).

SHA-256 和 SHA-512/256

这些是 256 位的 32 位和 64 位 SHA-2 算法尺寸.

我注意到以下好处:

  • 这两种算法都是众所周知的,并且经过大量分析.
  • 两种算法都提供 256 位原像电阻.

总结

具有最大实现可用性的算法是SHA-256、SHA3-256、BLAKE2b 和 SHAKE128.

在命令行可用性方面,BLAKE2b、SHA-256、SHA-512/256、SHA3​​-256 应该在不久的将来以合理的方式提供小型 Debian、Ubuntu 或 Fedora 安装.

就安全性而言,最保守的选择似乎是 SHA-256,SHA-512/256 和 SHA3-256.

性能优胜者是未加速的 BLAKE2b 和加速的 SHA-256.

建议结论基于:><块引用>

受欢迎程度

在其他条件相同的情况下,我们应该偏向于最广泛的用途推荐用于新项目.

硬件加速

唯一广泛部署的硬件加速是针对 SHA-1 和 SHA-256 来自 SHA-2 系列的,但值得注意的是较新的 SHA-3 系列(2015 年发布)中没有任何内容.

年龄

类似于流行";将事物偏向哈希值似乎更好这已经有一段时间了,即现在选择还为时过早SHA-3.

哈希转换计划一旦实施,也可以让将来更容易切换到其他东西,所以我们不应该急于选择一些更新的哈希,因为我们需要为了永远保持它,我们总是可以再过 10-15 年再做一次过渡.

结果:commit 0ed8d8d,Git v2.19.0-rc0,2008 年 8 月 4 日.

<块引用>

SHA-256 有许多优点:

  • 它已经存在一段时间了,被广泛使用,并且几乎每个加密库(OpenSSL、mbedTLS、CryptoNG、SecureTransport 等)都支持它.

  • 当您与 SHA1DC 进行比较时,即使没有加速,大多数矢量化 SHA-256 实现确实更快.

  • 如果我们使用 OpenPGP(或什至,我想是 CMS)进行签名,我们将使用 SHA-2,因此让我们的安全性依赖于两个独立的协议是没有意义的当我们只依赖一个算法时,其中任何一个都可能破坏安全性.

所以是 SHA-256.

这个想法仍然存在:任何关于 SHA1 的概念都从 Git 代码库中删除,并被一个通用的哈希"取代.变量.
明天,该散列将是 SHA2,但该代码将来会支持其他散列.

作为 Linus Torvalds 巧妙地提出它(强调我的):

<块引用>

老实说,可观测宇宙中的粒子数量在 2**256 的数量级.这是一个非常大的数字.

不要让代码库比它需要的更复杂.
做出明智的技术决定,并说256 位是批量".

工程和理论的区别在于工程进行权衡.
好的软件是精心设计的,而不是理论化的
.

另外,我建议 git 默认为abbrev-commit=40",这样默认情况下,实际上没有人看到新的位.
所以使用[0-9a-f]{40}"的perl脚本等因为哈希模式会默默地继续工作.

因为向后兼容性很重要 (*)

(*) 而 2**160 仍然是一个很大的数字,并没有真正成为实际问题,SHA1DC 可能是下一个很好的散列十年或更长时间.

(SHA1DC,用于检测(?)碰撞",是 2017 年初讨论,在碰撞攻击之后 shattered.io 实例:参见 commit 28dc98e,Git v2.13.0-rc0,2017 年 3 月,来自 Jeff Kinggit 中的哈希冲突")


文档/中查看更多信息Technical/hash-function-transition.txt

<块引用>

向 SHA-256 的过渡可以一次在一个本地存储库中完成.

一个.不需要任何其他方采取任何行动.
湾SHA-256 存储库可以与 SHA-1 Git 服务器通信(推送/获取).
C.用户可以对对象交替使用 SHA-1 和 SHA-256 标识符(请参阅下面的命令行中的对象名称").
d.新签名对象使用比 SHA-1 更强的哈希函数来保证安全.


Git 2.27(2020 年第二季度)及其git fast-import --rewrite-submodules-from/to=:

参见 commit 1bdca81提交 d9db599, 提交 abe0cc5提交 ddddf8d, >提交 42d4e1d提交 e02a714提交 efa7ae3, commit 3c9331a, 提交 8b8f718提交 cfe3917commit bf154a8, 提交"noreferrer">提交 8dca7f3提交 6946e52https://github.com/git/git/commit/8bd5a2906ebca9e2d7fcecd1628c1585ffbd85d3" rel="noreferrer">commit 8bd5a29, 提交 1f5f8f3, 提交 192b517, 提交 61e2a70, commit dadacf1, commit 2078991(2020 年 2 月 22 日)由 brian m.卡尔森 (bk2204).
(由 Junio C Hamano 合并 -- gitster --提交 f8cb64e,2020 年 3 月 27 日)

<块引用>

fast-import:添加重写子模块的选项

签字人:brian m.卡尔森

<块引用>

将使用子模块的存储库从一种哈希算法转换为另一种时,有必要将子模块从旧算法重写为新算法,因为只有对子模块的引用,而不是它们的内容,会写入快速-导出流.
在不重写子模块的情况下,快速导入失败并显示Invalid dataref"遇到另一个算法中的子模块时出错.

添加一对选项,--rewrite-submodules-from--rewrite-submodules-to,它们获取由 生成的标记列表处理子模块时分别进行fast-exportfast-import.
使用这些标记将子模块提交从旧算法映射到新算法.

我们将标记读入两个对应的结构 mark_set 对象,然后使用哈希表执行从旧到新的映射.这让我们可以重用在其他地方使用的相同标记解析代码,并允许我们根据标记的 ID 有效地读取和匹配标记,因为标记文件不需要排序.

请注意,因为我们使用 khash 表作为对象 ID,并且该表复制结构 object_id 的值而不是引用它们,因此有必要将我们用于在表中插入和查找的结构 object_id 值归零.否则,我们最终会得到不匹配的 SHA-1 值,因为在未使用的区域中可能会留下任何堆栈垃圾.

git fast-import 文档 现在包括:

<块引用>

子模块重写

<块引用>

--rewrite-submodules-from=:--rewrite-submodules-to=:

指定的子模块的对象 ID 从 from 中使用的值重写为 to 中使用的值<文件>.
from 标记应该由 git fast-export 创建,to 标记应该由 git fast-import 创建 导入相同的子模块时.

可以是任何不包含冒号字符的任意字符串,但在指定相应标记时,两个选项必须使用相同的值.
可以使用不同的值指定多个子模块.不成对使用这些选项是错误的.

这些选项主要用于将存储库从一种哈希算法转换为另一种哈希算法时;没有它们,如果遇到子模块,fast-import 将失败,因为它无法将对象 ID 写入新的哈希算法.

还有:

<块引用>

commit:使用预期的签名标头进行 SHA-256

签字人:brian m.卡尔森

<块引用>

过渡计划预计我们将允许在一次提交中使用多种算法进行签名.
为了做到这一点,我们需要为每个算法使用不同的标头,以便明确计算签名的数据.

转换计划指定我们应该使用gpgsig-sha256",因此连接提交代码,以便它可以编写和解析当前算法,并且可以删除用于创建新提交时的任何算法.
添加测试以确保我们使用正确的标题进行编写,并且 git fsck 不会拒绝这些提交.


注意:最后的快速导入进化有一个令人讨厌的副作用:"git fast-import"(man) 在使用多个标记时浪费了大量内存.
这应该在 Git 2.30(2020 年第一季度)中修复

参见 commit 3f018ec(2020 年 10 月 15 日)杰夫·金 (peff).
(由 Junio C Hamano 合并 -- gitster --提交 cd47bbe,2020 年 11 月 2 日)

<块引用>

fast-import:修复过度分配标记存储

报告人:谢尔盖布雷斯特
签字人:Jeff King

<块引用>

Fast-import 将其标记存储在由 mark_set 结构组成的类似树状结构的结构中.
(Trie:数字树)
每个结构体都有一个固定的大小 (1024).如果我们的 id 号太大而无法放入结构中,那么我们分配一个新的 struct 将 id 号移动 10 位.我们原来的struct成为这个新层的子节点,新的struct成为trie的顶层.

这个方案被 ddddf8d7e2 ("fast-import: 允许读取多个标记文件", 2020-02-22, Git v2.27.0-rc0 -- batch #2 中列出的>合并").在此之前,我们有一个顶级的标记".指针,下推通过将新的顶级结构分配给标记"来工作.但是在那次提交之后,insert_mark() 接受一个指向 mark_set, 的指针,而不是使用全局标记".它继续分配给全局标记".下推期间的变量,这是错误的,原因有两个:

  • 我们在 option_rewrite_submodules() 中添加了一个使用单独标记集的调用;下压标记"这里是完全错误的.我们会破坏标记"设置,我们将无法正确存储 id 超过 1024 的任何子模块映射.
  • 其他调用者通过了标记",但下推仍然是错误的.在read_mark_file()中,我们将指向mark_set的指针作为参数.因此,即使 insert_mark() 正在更新全局标记",我们在 read_mark_file() 中的本地指针并未更新.因此,我们会在需要时添加一个新级别,但是接下来对 insert_mark() 的调用将看不到它!然后它会分配一个新层,它也不会被看到,依此类推.查找丢失的层显然不起作用,但在我们进入任何查找阶段之前,我们通常会耗尽内存并死亡.

我们的测试没有注意到这两种情况,因为它们没有足够的标记来触发下推行为.t9304 中的新测试涵盖了这两种情况(如果没有这个补丁就会失败).

我们可以通过让 insert_mark() 获取一个指向集合顶层的指针的指针来解决这个问题.然后我们的下推可以以调用者实际看到的方式分配给它.请注意 option_rewrite_submodules() 中微妙的重新排序.我们对 read_mark_file() 的调用可能会修改我们的顶级 set 指针,因此我们必须等到它返回后才能将其值赋给 string_list.

I just learned from this HN-post that git is moving to new hashing algorithm ( from SHA-1 to SHA-256 )

I wanted to know what makes SHA-256 best fit for git's use case. Is there any/many strong technical reason or is it possible that SHA-256 popularity is a strong factor ? ( I am making a guess ) Looking at https://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions page I see thee are many modern and older alternatives present. some of them are more ( almost same if not more ) performant and stronger than SHA-256 ( example https://crypto.stackexchange.com/q/26336 )

解决方案

I have presented that move in "Why doesn't Git use more modern SHA?" in Aug. 2018

The reasons were discussed here by Brian M. Carlson:

I've implemented and tested the following algorithms, all of which are 256-bit (in alphabetical order):

  • BLAKE2b (libb2)
  • BLAKE2bp (libb2)
  • KangarooTwelve (imported from the Keccak Code Package)
  • SHA-256 (OpenSSL)
  • SHA-512/256 (OpenSSL)
  • SHA3-256 (OpenSSL)
  • SHAKE128 (OpenSSL)

I also rejected some other candidates.
I couldn't find any reference or implementation of SHA256×16, so I didn't implement it.
I didn't consider SHAKE256 because it is nearly identical to SHA3-256 in almost all characteristics (including performance).

SHA-256 and SHA-512/256

These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in size.

I noted the following benefits:

  • Both algorithms are well known and heavily analyzed.
  • Both algorithms provide 256-bit preimage resistance.

Summary

The algorithms with the greatest implementation availability are SHA-256, SHA3-256, BLAKE2b, and SHAKE128.

In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, and SHA3-256 should be available in the near future on a reasonably small Debian, Ubuntu, or Fedora install.

As far as security, the most conservative choices appear to be SHA-256, SHA-512/256, and SHA3-256.

The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated.

The suggested conclusion was based on:

Popularity

Other things being equal we should be biased towards whatever's in the widest use & recommended for new projects.

Hardware acceleration

The only widely deployed HW acceleration is for the SHA-1 and SHA-256 from the SHA-2 family, but notably nothing from the newer SHA-3 family (released in 2015).

Age

Similar to "popularity" it seems better to bias things towards a hash that's been out there for a while, i.e. it would be too early to pick SHA-3.

The hash transitioning plan, once implemented, also makes it easier to switch to something else in the future, so we shouldn't be in a rush to pick some newer hash because we'll need to keep it forever, we can always do another transition in another 10-15 years.

Result: commit 0ed8d8d, Git v2.19.0-rc0, Aug 4, 2018.

SHA-256 has a number of advantages:

  • It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc).

  • When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration.

  • If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one.

So SHA-256 it is.

The idea remains: Any notion of SHA1 is being removed from Git codebase and replaced by a generic "hash" variable.
Tomorrow, that hash will be SHA2, but the code will support other hashes in the future.

As Linus Torvalds delicately puts it (emphasis mine):

Honestly, the number of particles in the observable universe is on the order of 2**256. It's a really really big number.

Don't make the code base more complex than it needs to be.
Make a informed technical decision, and say "256 bits is a lot".

The difference between engineering and theory is that engineering makes trade-offs.
Good software is well engineered, not theorized
.

Also, I would suggest that git default to "abbrev-commit=40", so that nobody actually sees the new bits by default.
So the perl scripts etc that use "[0-9a-f]{40}" as a hash pattern would just silently continue to work.

Because backwards compatibility is important (*)

(*) And 2**160 is still a big big number, and hasn't really been a practical problem, and SHA1DC is likely a good hash for the next decade or longer.

(SHA1DC, for "Detecting(?) Collision", was discussed in early 2017, after the collision attack shattered.io instance: see commit 28dc98e, Git v2.13.0-rc0, March 2017, from Jeff King, and "Hash collision in git")


See more in Documentation/technical/hash-function-transition.txt

The transition to SHA-256 can be done one local repository at a time.

a. Requiring no action by any other party.
b. A SHA-256 repository can communicate with SHA-1 Git servers (push/fetch).
c. Users can use SHA-1 and SHA-256 identifiers for objects interchangeably (see "Object names on the command line", below).
d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees.


That transition is facilitated with Git 2.27 (Q2 2020), and its git fast-import --rewrite-submodules-from/to=<name>:<file>

See commit 1bdca81, commit d9db599, commit 11d8ef3, commit abe0cc5, commit ddddf8d, commit 42d4e1d, commit e02a714, commit efa7ae3, commit 3c9331a, commit 8b8f718, commit cfe3917, commit bf154a8, commit 8dca7f3, commit 6946e52, commit 8bd5a29, commit 1f5f8f3, commit 192b517, commit 9412759, commit 61e2a70, commit dadacf1, commit 768e30e, commit 2078991 (22 Feb 2020) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit f8cb64e, 27 Mar 2020)

fast-import: add options for rewriting submodules

Signed-off-by: brian m. carlson

When converting a repository using submodules from one hash algorithm to another, it is necessary to rewrite the submodules from the old algorithm to the new algorithm, since only references to submodules, not their contents, are written to the fast-export stream.
Without rewriting the submodules, fast-import fails with an "Invalid dataref" error when encountering a submodule in another algorithm.

Add a pair of options, --rewrite-submodules-from and --rewrite-submodules-to, that take a list of marks produced by fast-export and fast-import, respectively, when processing the submodule.
Use these marks to map the submodule commits from the old algorithm to the new algorithm.

We read marks into two corresponding struct mark_set objects and then perform a mapping from the old to the new using a hash table. This lets us reuse the same mark parsing code that is used elsewhere and allows us to efficiently read and match marks based on their ID, since mark files need not be sorted.

Note that because we're using a khash table for the object IDs, and this table copies values of struct object_id instead of taking references to them, it's necessary to zero the struct object_id values that we use to insert and look up in the table. Otherwise, we would end up with SHA-1 values that don't match because of whatever stack garbage might be left in the unused area.

The git fast-import documentation now includes:

Submodule Rewriting

--rewrite-submodules-from=<name>:<file>
--rewrite-submodules-to=<name>:<file>

Rewrite the object IDs for the submodule specified by <name> from the values used in the from <file> to those used in the to <file>.
The from marks should have been created by git fast-export, and the to marks should have been created by git fast-import when importing that same submodule.

<name> may be any arbitrary string not containing a colon character, but the same value must be used with both options when specifying corresponding marks.
Multiple submodules may be specified with different values for . It is an error not to use these options in corresponding pairs.

These options are primarily useful when converting a repository from one hash algorithm to another; without them, fast-import will fail if it encounters a submodule because it has no way of writing the object ID into the new hash algorithm.

And:

commit: use expected signature header for SHA-256

Signed-off-by: brian m. carlson

The transition plan anticipates that we will allow signatures using multiple algorithms in a single commit.
In order to do so, we need to use a different header per algorithm so that it will be obvious over which data to compute the signature.

The transition plan specifies that we should use "gpgsig-sha256", so wire up the commit code such that it can write and parse the current algorithm, and it can remove the headers for any algorithm when creating a new commit.
Add tests to ensure that we write using the right header and that git fsck doesn't reject these commits.


Note: that last fast-import evolution had a nasty side-effect: "git fast-import"(man) wasted a lot of memory when many marks were in use.
That should be fixed with Git 2.30 (Q1 2020)

See commit 3f018ec (15 Oct 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit cd47bbe, 02 Nov 2020)

fast-import: fix over-allocation of marks storage

Reported-by: Sergey Brester
Signed-off-by: Jeff King

Fast-import stores its marks in a trie-like structure made of mark_set structs.
(Trie: digital tree)
Each struct has a fixed size (1024). If our id number is too large to fit in the struct, then we allocate a new struct which shifts the id number by 10 bits. Our original struct becomes a child node of this new layer, and the new struct becomes the top level of the trie.

This scheme was broken by ddddf8d7e2 ("fast-import: permit reading multiple marks files", 2020-02-22, Git v2.27.0-rc0 -- merge listed in batch #2). Before then, we had a top-level "marks" pointer, and the push-down worked by assigning the new top-level struct to "marks". But after that commit, insert_mark() takes a pointer to the mark_set, rather than using the global "marks". It continued to assign to the global "marks" variable during the push down, which was wrong for two reasons:

  • we added a call in option_rewrite_submodules() which uses a separate mark set; pushing down on "marks" is outright wrong here. We'd corrupt the "marks" set, and we'd fail to correctly store any submodule mappings with an id over 1024.
  • the other callers passed "marks", but the push-down was still wrong. In read_mark_file(), we take the pointer to the mark_set as a parameter. So even though insert_mark() was updating the global "marks", the local pointer we had in read_mark_file() was not updated. As a result, we'd add a new level when needed, but then the next call to insert_mark() wouldn't see it! It would then allocate a new layer, which would also not be seen, and so on. Lookups for the lost layers obviously wouldn't work, but before we even hit any lookup stage, we'd generally run out of memory and die.

Our tests didn't notice either of these cases because they didn't have enough marks to trigger the push-down behavior. The new tests in t9304 cover both cases (and fail without this patch).

We can solve the problem by having insert_mark() take a pointer-to-pointer of the top-level of the set. Then our push down can assign to it in a way that the caller actually sees. Note the subtle reordering in option_rewrite_submodules(). Our call to read_mark_file() may modify our top-level set pointer, so we have to wait until after it returns to assign its value into the string_list.

这篇关于Git 正在转向新的哈希算法 SHA-256 但为什么 git 社区选择了 SHA-256的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆