散列和加密算法之间的根本区别 [英] Fundamental difference between Hashing and Encryption algorithms

查看:139
本文介绍了散列和加密算法之间的根本区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到散列和加密算法之间有很多混淆,我想听听更多专家的建议:

  1. 何时使用散列与加密

  2. 是什么让散列或加密算法与众不同(从理论/数学层面来看)即是什么使哈希不可逆转(没有彩虹树的帮助)

这里有一些类似的 SO 问题,但没有我想要的那么详细:

混淆、散列和加密有什么区别?
加密和散列的区别

解决方案

嗯,你可以在 中查找维基百科...但既然你想要一个解释,我会在这里尽力:

哈希函数

它们提供任意长度输入和(通常)固定长度(或更短长度)输出之间的映射.它可以是任何东西,从简单的 crc32 到完整的加密哈希函数,例如 MD5 或 SHA1/2/256/512.关键是正在进行单向映射.它总是一个多:1 映射(意味着总是会有冲突),因为每个函数产生的输出都比它能够输入的要小(如果你将每个可能的 1mb 文件输入 MD5,你会得到大量的冲突).

它们难以(或在实践中不可能)逆转的原因在于它们的内部工作方式.大多数加密哈希函数在输入集上迭代多次以产生输出.因此,如果我们查看每个固定长度的输入块(取决于算法),哈希函数将调用当前状态.然后它将迭代状态并将其更改为一个新状态并将其用作对自身的反馈(MD5 对每个 512 位数据块执行 64 次此操作).然后它以某种方式将所有这些迭代的结果状态重新组合在一起,形成结果哈希.

现在,如果您想解码散列,您首先需要弄清楚如何将给定的散列拆分为其迭代状态(对于小于数据块大小的输入有 1 种可能性,对于较大的输入有许多可能性).然后你需要反转每个状态的迭代.现在,要解释为什么这非常困难,请想象尝试从以下公式推导出 ab:10 = a + b.ab 有 10 种正组合可以工作.现在循环多次: tmp = a + b;a = b;b = tmp.对于 64 次迭代,您将有超过 10^64 种可能性可以尝试.这只是一个简单的添加,其中从迭代到迭代保留了一些状态.真正的哈希函数执行的操作远不止 1 次(MD5 对 4 个状态变量执行大约 15 次操作).而且由于下一次迭代取决于前一次的状态,而前一次在创建当前状态时被破坏,因此几乎不可能确定导致给定输出状态的输入状态(对于每次迭代都不少).结合所涉及的大量可能性,即使解码 MD5 也将占用近乎无限(但不是无限)数量的资源.如此多的资源,如果您了解输入的大小(对于较小的输入),那么暴力破解哈希实际上比尝试解码哈希要便宜得多.

加密函数

它们在任意长度的输入和输出之间提供 1:1 的映射.而且它们总是可逆的.需要注意的重要一点是,使用某种方法它是可逆的.对于给定的键,它始终是 1:1.现在,有多个输入:密钥对可能会生成相同的输出(实际上通常有,取决于加密函数).好的加密数据与随机噪声无法区分.这与始终具有一致格式的良好散列输出不同.

用例

当您想比较一个值但无法存储纯表示(出于多种原因)时,请使用哈希函数.密码应该非常适合这个用例,因为出于安全原因您不想将它们存储为纯文本(也不应该).但是,如果您想检查盗版音乐文件的文件系统怎么办?每个音乐文件存储 3 mb 是不切实际的.因此,取而代之的是获取文件的哈希值并存储它(md5 将存储 16 个字节而不是 3mb).这样,您只需对每个文件进行散列并与存储的散列数据库进行比较(由于重新编码、更改文件头等,这在实践中效果不佳,但这是一个示例用例).

在检查输入数据的有效性时使用哈希函数.这就是它们的设计目的.如果您有 2 个输入,并且想检查它们是否相同,请通过哈希函数运行它们.对于小输入大小(假设一个好的散列函数),碰撞的概率是天文数字.这就是为什么建议使用密码的原因.对于最多 32 个字符的密码,md5 有 4 倍的输出空间.SHA1 有 6 倍的输出空间(大约).SHA512 大约有 16 倍的输出空间.您并不真正关心密码是什么,您关心的是它是否与存储的密码相同.这就是为什么你应该使用哈希作为密码的原因.

每当您需要取回输入数据时,请使用加密.注意需要这个词.如果您要存储信用卡号,则需要在某个时候将它们取回,但不想以纯文本形式存储它们.因此,请存储加密版本并尽可能保证密钥的安全.

哈希函数也非常适合签署数据.例如,如果您使用 HMAC,您可以通过将数据与已知但未传输的值(秘密值)连接在一起的哈希值来签署数据.因此,您发送纯文本和 HMAC 哈希.然后,接收器简单地使用已知值散列提交的数据,并检查它是否与传输的 HMAC 匹配.如果相同,您就知道它没有被没有秘密值的一方篡改.这通常用于 HTTP 框架的安全 cookie 系统,以及通过 HTTP 传输数据的消息,您需要在某些情况下保证数据的完整性.

关于密码哈希的说明:

加密散列函数的一个关键特征是它们的创建速度应该非常快,并且非常很难/很慢地反转(以至于几乎不可能).这给密码带来了问题.如果你存储 sha512(password),你就没有做任何事情来防范彩虹表或蛮力攻击.请记住,哈希函数是为速度而设计的.因此,攻击者只需通过哈希函数运行字典并测试每个结果即可.

添加盐会有所帮助,因为它会向哈希添加一些未知数据.因此,与其找到与 md5(foo) 匹配的任何东西,他们需要找到一些东西,当添加到已知盐时会产生 md5(foo.salt)(这是非常更难做).但它仍然没有解决速度问题,因为如果他们知道盐,那只是运行字典的问题.

所以,有办法解决这个问题.一种流行的方法称为键强化(或键拉伸).基本上,您多次迭代哈希(通常为数千次).这有两件事.首先,它显着减慢了散列算法的运行时间.其次,如果正确实施(在每次迭代中将输入和盐传递回)实际上会增加输出的熵(可用空间),从而减少碰撞的机会.一个简单的实现是:

var hash = 密码 + 盐;for (var i = 0; i <5000; i++) {散列 = sha512(散列 + 密码 + 盐);}

还有其他更标准的实现,例如 PBKDF2BCrypt.但是这种技术被很多安全相关的系统(如 PGP、WPA、Apache 和 OpenSSL)使用.

归根结底,hash(password) 不够好.hash(password + salt) 更好,但仍然不够好......使用拉伸散列机制来生成你的密码散列......

关于平凡拉伸的另一个注意事项

在任何情况下都不要将一个哈希的输出直接反馈给哈希函数:

hash = sha512(password + salt);for (i = 0; i <1000; i++) {哈希= sha512(哈希);//<-- 不要这样做!}

这样做的原因与碰撞有关.请记住,所有哈希函数都有冲突,因为可能的输出空间(可能的输出数量)小于输入空间.要了解原因,让我们看看会发生什么.在此之前,让我们假设 sha1() 发生碰撞的可能性为 0.001%(实际上远低于,但出于演示目的).

hash1 = sha1(密码+盐);

现在,hash1 的碰撞概率为 0.001%.但是当我们做下一个hash2 = sha1(hash1);时,hash1的所有碰撞都会自动变成hash2的碰撞.所以现在,我们的 hash1 的比率为 0.001%,第二个 sha1() 调用增加了这一点.所以现在,hash2 的碰撞概率为 0.002%.那是两倍的机会!每次迭代都会为结果添加另一个 0.001% 碰撞机会.因此,经过 1000 次迭代,碰撞的几率从微不足道的 0.001% 跃升至 1%.现在,退化是线性的,实际概率更小,但效果是一样的(与md5发生单次碰撞的概率估计约为1/(2128) 或 1/(3x1038).虽然这看起来很小,但要感谢 生日攻击它并不像看起来那么小.

相反,通过每次重新附加盐和密码,您将数据重新引入哈希函数.因此,任何特定回合的任何碰撞都不再是下一轮的碰撞.所以:

hash = sha512(password + salt);for (i = 0; i <1000; i++) {散列 = sha512(散列 + 密码 + 盐);}

与原生 sha512 函数有相同的碰撞机会.这就是你想要的.改用那个.

I see a lot of confusion between hashes and encryption algorithms and I would like to hear some more expert advice about:

  1. When to use hashes vs encryptions

  2. What makes a hash or encryption algorithm different (from a theoretical/mathematical level) i.e. what makes hashes irreversible (without aid of a rainbow tree)

Here are some similar SO Questions that didn't go into as much detail as I was looking for:

What is the difference between Obfuscation, Hashing, and Encryption?
Difference between encryption and hashing

解决方案

Well, you could look it up in Wikipedia... But since you want an explanation, I'll do my best here:

Hash Functions

They provide a mapping between an arbitrary length input, and a (usually) fixed length (or smaller length) output. It can be anything from a simple crc32, to a full blown cryptographic hash function such as MD5 or SHA1/2/256/512. The point is that there's a one-way mapping going on. It's always a many:1 mapping (meaning there will always be collisions) since every function produces a smaller output than it's capable of inputting (If you feed every possible 1mb file into MD5, you'll get a ton of collisions).

The reason they are hard (or impossible in practicality) to reverse is because of how they work internally. Most cryptographic hash functions iterate over the input set many times to produce the output. So if we look at each fixed length chunk of input (which is algorithm dependent), the hash function will call that the current state. It will then iterate over the state and change it to a new one and use that as feedback into itself (MD5 does this 64 times for each 512bit chunk of data). It then somehow combines the resultant states from all these iterations back together to form the resultant hash.

Now, if you wanted to decode the hash, you'd first need to figure out how to split the given hash into its iterated states (1 possibility for inputs smaller than the size of a chunk of data, many for larger inputs). Then you'd need to reverse the iteration for each state. Now, to explain why this is VERY hard, imagine trying to deduce a and b from the following formula: 10 = a + b. There are 10 positive combinations of a and b that can work. Now loop over that a bunch of times: tmp = a + b; a = b; b = tmp. For 64 iterations, you'd have over 10^64 possibilities to try. And that's just a simple addition where some state is preserved from iteration to iteration. Real hash functions do a lot more than 1 operation (MD5 does about 15 operations on 4 state variables). And since the next iteration depends on the state of the previous and the previous is destroyed in creating the current state, it's all but impossible to determine the input state that led to a given output state (for each iteration no less). Combine that, with the large number of possibilities involved, and decoding even an MD5 will take a near infinite (but not infinite) amount of resources. So many resources that it's actually significantly cheaper to brute-force the hash if you have an idea of the size of the input (for smaller inputs) than it is to even try to decode the hash.

Encryption Functions

They provide a 1:1 mapping between an arbitrary length input and output. And they are always reversible. The important thing to note is that it's reversible using some method. And it's always 1:1 for a given key. Now, there are multiple input:key pairs that might generate the same output (in fact there usually are, depending on the encryption function). Good encrypted data is indistinguishable from random noise. This is different from a good hash output which is always of a consistent format.

Use Cases

Use a hash function when you want to compare a value but can't store the plain representation (for any number of reasons). Passwords should fit this use-case very well since you don't want to store them plain-text for security reasons (and shouldn't). But what if you wanted to check a filesystem for pirated music files? It would be impractical to store 3 mb per music file. So instead, take the hash of the file, and store that (md5 would store 16 bytes instead of 3mb). That way, you just hash each file and compare to the stored database of hashes (This doesn't work as well in practice because of re-encoding, changing file headers, etc, but it's an example use-case).

Use a hash function when you're checking validity of input data. That's what they are designed for. If you have 2 pieces of input, and want to check to see if they are the same, run both through a hash function. The probability of a collision is astronomically low for small input sizes (assuming a good hash function). That's why it's recommended for passwords. For passwords up to 32 characters, md5 has 4 times the output space. SHA1 has 6 times the output space (approximately). SHA512 has about 16 times the output space. You don't really care what the password was, you care if it's the same as the one that was stored. That's why you should use hashes for passwords.

Use encryption whenever you need to get the input data back out. Notice the word need. If you're storing credit card numbers, you need to get them back out at some point, but don't want to store them plain text. So instead, store the encrypted version and keep the key as safe as possible.

Hash functions are also great for signing data. For example, if you're using HMAC, you sign a piece of data by taking a hash of the data concatenated with a known but not transmitted value (a secret value). So, you send the plain-text and the HMAC hash. Then, the receiver simply hashes the submitted data with the known value and checks to see if it matches the transmitted HMAC. If it's the same, you know it wasn't tampered with by a party without the secret value. This is commonly used in secure cookie systems by HTTP frameworks, as well as in message transmission of data over HTTP where you want some assurance of integrity in the data.

A note on hashes for passwords:

A key feature of cryptographic hash functions is that they should be very fast to create, and very difficult/slow to reverse (so much so that it's practically impossible). This poses a problem with passwords. If you store sha512(password), you're not doing a thing to guard against rainbow tables or brute force attacks. Remember, the hash function was designed for speed. So it's trivial for an attacker to just run a dictionary through the hash function and test each result.

Adding a salt helps matters since it adds a bit of unknown data to the hash. So instead of finding anything that matches md5(foo), they need to find something that when added to the known salt produces md5(foo.salt) (which is very much harder to do). But it still doesn't solve the speed problem since if they know the salt it's just a matter of running the dictionary through.

So, there are ways of dealing with this. One popular method is called key strengthening (or key stretching). Basically, you iterate over a hash many times (thousands usually). This does two things. First, it slows down the runtime of the hashing algorithm significantly. Second, if implemented right (passing the input and salt back in on each iteration) actually increases the entropy (available space) for the output, reducing the chances of collisions. A trivial implementation is:

var hash = password + salt;
for (var i = 0; i < 5000; i++) {
    hash = sha512(hash + password + salt);
}

There are other, more standard implementations such as PBKDF2, BCrypt. But this technique is used by quite a few security related systems (such as PGP, WPA, Apache and OpenSSL).

The bottom line, hash(password) is not good enough. hash(password + salt) is better, but still not good enough... Use a stretched hash mechanism to produce your password hashes...

Another note on trivial stretching

Do not under any circumstances feed the output of one hash directly back into the hash function:

hash = sha512(password + salt); 
for (i = 0; i < 1000; i++) {
    hash = sha512(hash); // <-- Do NOT do this!
}

The reason for this has to do with collisions. Remember that all hash functions have collisions because the possible output space (the number of possible outputs) is smaller than then input space. To see why, let's look at what happens. To preface this, let's make the assumption that there's a 0.001% chance of collision from sha1() (it's much lower in reality, but for demonstration purposes).

hash1 = sha1(password + salt);

Now, hash1 has a probability of collision of 0.001%. But when we do the next hash2 = sha1(hash1);, all collisions of hash1 automatically become collisions of hash2. So now, we have hash1's rate at 0.001%, and the 2nd sha1() call adds to that. So now, hash2 has a probability of collision of 0.002%. That's twice as many chances! Each iteration will add another 0.001% chance of collision to the result. So, with 1000 iterations, the chance of collision jumped from a trivial 0.001% to 1%. Now, the degradation is linear, and the real probabilities are far smaller, but the effect is the same (an estimation of the chance of a single collision with md5 is about 1/(2128) or 1/(3x1038). While that seems small, thanks to the birthday attack it's not really as small as it seems).

Instead, by re-appending the salt and password each time, you're re-introducing data back into the hash function. So any collisions of any particular round are no longer collisions of the next round. So:

hash = sha512(password + salt);
for (i = 0; i < 1000; i++) {
    hash = sha512(hash + password + salt);
}

Has the same chance of collision as the native sha512 function. Which is what you want. Use that instead.

这篇关于散列和加密算法之间的根本区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆