将SHA-1存储在数据库中的空间少于40个十六进制数字 [英] Store SHA-1 in database in less space than the 40 hex digits

查看:72
本文介绍了将SHA-1存储在数据库中的空间少于40个十六进制数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用哈希算法为数据库表创建主键.我使用SHA-1算法,这对我来说已经足够了.该数据库甚至提供了SHA-1的实现.计算散列的函数返回一个十六进制值,为40个字符.因此,我将十六进制字符存储在 char(40)列中.

I am using a hash algorithm to create a primary key for a database table. I use the SHA-1 algorithm which is more than fine for my purposes. The database even ships an implementation for SHA-1. The function computing the hash is returning a hex value as 40 characters. Therefore I am storing the hex characters in a char(40) column.

该表将有很多行,> = 200Mio.行,这就是为什么我正在寻找较少数据密集型存储哈希的原因.40个字符乘以〜200Mio.行将需要一些GB的存储空间...由于hex是base16,我想我可以尝试将其存储在base 256中,以期将所需的字符数减少到20个左右.您是否有有关使用base 256进行压缩的技巧或论文?

The table will have lots of rows, >= 200 Mio. rows which is why I am looking for less data intensive ways of storing the hash. 40 characters times ~200 Mio. rows will require some GB of storage... Since hex is base16 I thought I could try to store it in base 256 in hope to reduce the amount of characters needed to around 20 characters. Do you have tips or papers on implementations of compression with base 256?

推荐答案

SHA-1值为20个字节.这20个字节中的所有位都是有效的,无法压缩它们.通过以字节的十六进制表示形式存储字节,您浪费了一半的空间-存储一个字节恰好需要两个十六进制数字.因此,您不能压缩基础值,但是可以使用比十六进制更好的编码.

A SHA-1 value is 20 bytes. All the bits in these 20 bytes are significant, there's no way to compress them. By storing the bytes in their hexadecimal notation, you're wasting half the space — it takes exactly two hexadecimal digits to store a byte. So you can't compress the underlying value, but you can use a better encoding than hexadecimal.

存储为Blob 是正确的答案.这是基于256的字节.您将每个字节存储为该字节,而没有编码会产生一些开销.浪费空间:0.

Storing as a blob is the right answer. That's base 256. You're storing each byte as that byte with no encoding that would create some overhead. Wasted space: 0.

如果由于某种原因您不能执行此操作,而需要使用可打印的字符串,则可以使用更紧凑的编码来完成比十六进制更好的操作.对于十六进制,存储要求是最小值的两倍(假设每个字符存储为一个字节).您可以使用 Base64 使存储要求达到每3个字节4个字符,即您需要28个字符来存储值.实际上,假设您知道长度为20个字节而不是21个字节,则base64编码将始终以 = 结尾,因此您只需要存储 27个字符和在解码之前恢复尾随的 = .

If for some reason you can't do that and you need to use a printable string, then you can do better than hexadecimal by using a more compact encoding. With hexadecimal, the storage requirement is twice the minimum (assuming that each character is stored as one byte). You can use Base64 to bring the storage requirements to 4 characters per 3 bytes, i.e. you would need 28 characters to store the value. In fact, given that you know that the length is 20 bytes and not 21, the base64 encoding will always end with a =, so you only need to store 27 characters and restore the trailing = before decoding.

您可以通过使用更多字符来进一步改善编码.Base64使用可用256个字节值中的64个代码点.ASCII(实际上是便携式的)具有95个可打印字符(包括空格),但是没有通用的"base95"编码,您必须自己滚动. Base85 是一个中间选择,它在实践中确实有用,可以存储20个字节值(可打印的25个ASCII字符).

You could improve the encoding further by using more characters. Base64 uses 64 code points out of the available 256 byte values. ASCII (the de facto portable) has 95 printable characters (including space), but there's no common "base95" encoding, you'd have to roll your own. Base85 is an intermediate choice, it does get some use in practice, and lets you store the 20-byte value in 25 printable ASCII characters.

这篇关于将SHA-1存储在数据库中的空间少于40个十六进制数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆