SHA-256散列被截断的冲突概率 [英] Probability of collision with truncated SHA-256 hash

查看:216
本文介绍了SHA-256散列被截断的冲突概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据库驱动的Web应用程序,其中对所有数据行的主键进行如下混淆:SHA256(内容类型+主键+秘密),被截断为前8个字符.内容类型是一个简单的单词,例如"post"或"message",秘密是20到30个字符的ASCII常数.结果存储在单独的索引列中,以便快速查找数据库.

在这种情况下,如何计算哈希冲突的可能性?我根本不是数学家,但是一个朋友声称,由于生日悖论,对于具有8个字符的截断的10,000行,碰撞概率约为1%.这个说法有什么道理吗?

解决方案

是的,发生碰撞的概率为&它可能有点太高了.确切的概率取决于"8个字符"的含义.

"8个字符"的意思是:

  • A)您存储8个十六进制字符的哈希吗?那会存储32位.
  • B)您存储8个字符的BASE-64吗?那将存储48位.
  • C)您存储8个字节,以某种单字节字符集编码/或以某种破损的方式被砍成字符编码吗?那会存储56-64位,但是如果您不正确编码,就会遇到字符转换问题.
  • D)您将8个字节存储为字节?真正存储了哈希的64位.

将二进制数据存储为A)十六进制或D)二进制字节是我的首选.但是我绝对建议您重新考虑您的密钥混淆"方案,或者显着扩展存储的密钥大小以减少(当前过多)密钥冲突的可能性.

从Wikipedia: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem

从更一般的意义上讲,生日问题适用于哈希函数:在发生冲突之前可以生成的N位哈希的预期数目不是2 ^ N,而是2 ^(N/2)./p>

由于以上对您的设计的理解最为保守(将其读为A,即8个十六进制字符== 32位),因此,如果该方案以大约64,000行的规模存储,则可能会遇到冲突.对于所有严重的甚至是玩具系统,我都认为这样的结果是不可接受的.

交易表的交易量/天,可能使业务增长,每天从1000-100,000个交易(或更多).系统应设计为可运行100年(36500天),并内置10倍的增长因子,因此.

要使您的键控机制真正可靠且&专业上有用的话,您将需要能够扩展它以潜在地处理约360亿(2 ^ 35)行而不会发生冲突.那将意味着超过70位的哈希值.

例如,源控制系统Git存储SHA-1哈希的160位(十六进制的40个字符== 20字节或160位).<所存储的文件修订版本少于2 ^ 80.


一种更好的设计可能是,而不是散列&完全对密钥进行伪随机化& ;;希望(反对希望)避免冲突,将哈希的8-10位预添加/追加/折叠到密钥中.

这将生成一个更大的密钥,其中包含原始密钥的所有唯一性以及8-10个验证位.然后将验证访问密钥的尝试,并且通过探查"密钥空间和安全来尝试将超过3个无效请求视为违反安全性的尝试.将触发半永久性锁定.

这里唯一的主要成本是对于给定的int大小适当减少可用键空间的大小.往返于浏览器的32位int将有8-10位用于安全性,因此实际的密钥为22-24位.因此,您将在不够的地方使用64位整数.

I have a database-driven web application where the primary keys of all data rows are obfuscated as follows: SHA256(content type + primary key + secret), truncated to the first 8 characters. The content type is a simple word, e.g. "post" or "message" and the secret is a 20-30 char ASCII constant. The result is stored in a separate indexed column for fast DB lookup.

How do I calculate the probability of a hash collision in this scenario? I am not a mathematician at all, but a friend claimed that due to the Birthday Paradox the collision probability would be ~1% for 10,000 rows with an 8-char truncation. Is there any truth to this claim?

解决方案

Yes, there is a collision probability & it's probably somewhat too high. The exact probability depends on what "8 characters" means.

Does "8 characters" mean:

  • A) You store 8 hex characters of the hash? That would store 32 bits.
  • B) You store 8 characters of BASE-64? That would store 48 bits.
  • C) You store 8 bytes, encoded in some single-byte charset/ or hacked in some broken way into a character encoding? That would store 56-64 bits, but if you don't do encoding right you'll encounter character conversion problems.
  • D) You store 8 bytes, as bytes? That genuinely stores 64 bits of the hash.

Storing binary data as either A) hex or D) binary bytes, would be my preferred options. But I'd definitely recommend either reconsidering your "key obfuscation" scheme or significantly expanding the stored key-size to reduce the (currently excessive) probability of key collision.

From Wikipedia: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem

The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2).

Since in the most conservative above understanding of your design (reading it as A, 8 chars of hex == 32 bits) your scheme would be expected to suffer collisions if it stored on the scale of ~64,000 rows. I would consider such an outcome unacceptable for all serious, or even toy, systems.

Transaction tables may have volumes, allowing growth for the business, from 1000 - 100,000 transactions/day (or more). Systems should be designed to function 100 years (36500 days), with a 10x growth factor built in, so..

For your keying mechanism to be genuinely robust & professionally useful, you would need to be able to scale it up to potentially handle ~36 billion (2^35) rows without collision. That would imply 70+ bits of hash.

The source-control system Git, for example, stores 160 bits of SHA-1 hash (40 chars of hex == 20 bytes or 160 bits). Collisions would not be expected to be probable with < less than 2^80 different file revisions stored.


A possibility better design might be, rather than hashing & pseudo-randomizing the key entirely & hoping (against hope) to avoid collisions, to prepend/ append/ fold-in 8-10 bits of a hash into the key.

This would generates a larger key, containing all the uniqueness of the original key plus 8-10 bits of verification. Attempts to access keys would then be verified, and more than 3 invalid requests would be treated as an attempt to violate security by "probing" the keyspace & would trigger semi-permanent lockout.

The only major costs here, would be a modest reduction in the size of the available keyspace for a given int-size. 32-bit int to/from the browser would have 8-10 bits dedicated to security, thus leaving 22-24 for the actual key. So you'd use 64-bit ints where that was not sufficient.

这篇关于SHA-256散列被截断的冲突概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆