大多数内存高效的方式来存储8M + sha256散列 [英] Most memory efficient way to store 8M+ sha256 hashes

查看:114
本文介绍了大多数内存高效的方式来存储8M + sha256散列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 dict 存储密钥和值都是sha256哈希摘要的键值对。我需要能够找出列表中是否存在一个键,还可以检索该dict的值。



目前我估计我会根据我的一些测试,需要大约10Gb的内存来存储8,000,000个哈希值,其中存储的实际数据只有大约512MB(每个散列32个字节,因此每个记录64个字节)


$ b $有没有人有任何建议?



更新,根据我认为我应该更新的一些评论。我正在将哈希存储为字节,而不是十六进制字符串。我正在使用一个sqlite数据库来永久存储数据,但是在大约1,000,000条记录之后插入许多具有索引的记录变得太慢,并且没有索引检查键的存在也会呈指数级变慢。这是为什么我要使用内存结构来进行查找。



更新2



atbr hashtable



我的解决方案:(我应该把它当作答案吗?)
我最后做的是从@abarnert创建了很多建议一个新的类实现1024个列表 [count,bytearray(8000 * 32),bytearray(8000 * 32)]



我使用哈希的前10位作为索引,我应该在其中存储哈希的列表。然后我只是将密钥附加到下一个32字节的时隙,并将值添加到另一个字节数组中的相同的时隙。



我可以生成16,000,000个散列(一个用于键,一个用于值),并在约30秒内将8,000,000个键值对插入到结构中。



搜索正好相反,我使用前10位来查找列表,然后我只是对散列进行线性搜索,直到找到它。



在rando中搜索20万个哈希值m从800万需要30秒,所以比写作慢40倍,但它应该足够快我的需要。



所有这些奖金,现在只能为所有8,000,000个哈希消耗519MB的RAM。



谢谢每个人都有你的帮助。

解决方案

首先,看看为什么这么大。 p>

每个都是32个字节。这意味着以二进制形式存储大约32个字节,例如存储一个 bytes bytearray 对象。到目前为止,这么好。



但是所有的Python对象都有标题,通常是24-64字节。通过快速检查,它看起来像 bytes 对象在32位(可能加上对齐填充)占用额外的36个字节,而在64位上占用了48个字节,至少在两个我检查过的CPython版本。



那么,如何才能摆脱这150%的额外存储空间?将字节包装成一个巨大的数组,如 bytes bytearray 。那么你总共有48个字节的总数加上每个散列的32个,而不是每个哈希的48 + 32个。当您需要访问哈希值时,如果您有索引,则只是切片 [index * 32:(index + 1)* 32]



另外,根据您如何创建字节,可能会出现一些溢出。您可以检查 - 如果 sys.getsizeof(s) - sys.getsizeof(b')> len(s),您需要切割所有对象以创建新的副本,而不需要额外的填充。



无论如何,现在你有8M额外的索引。如果这些是短暂的,那就没关系,但是如果您将这些他们存储为 int 在dict值插槽中,那么他们每个有一个标题。从快速测试中,在实际存储的4个字节之上(对于int <1 <31),在32位和64位中都有一个24字节的标题(尽管非常小的int可以被显示为标题)。所以,所有这一切都是将您的48字节的浪费削减到28个字节,这不是很好。



您可以使用某种形式的打包存储,如 数组 模块。数组类型 I 每个整数只使用4个字节。但是,然后你需要索引到数组中,这是你刚刚解决的同样的问题。



但是你真的不需要索引 - 如果你存储密钥自己在一个数组中,任何键的索引已经是字节串中的哈希的索引(除以32),对吗?



这只能在你可以将键存储在某种紧凑的阵列中。如果他们的大小相同,你可以再次使用相同的巨人测试技巧来做到这一点。而在你的情况下,它们是 - 键也是 32字节哈希。所以你只需要按照键值排序两个巨型字节串(见 bisect 模块,所以你不必自己编写代码)。



当然使用二进制搜索算法而不是哈希表示您正在进行查找并插入对数而不是常量。而log(8M)只有16左右,这比8M好很多,但仍然是16倍,但是这是有效的从理想调整的关系数据库中得到的,除了你不需要做任何调整,这一切都在记忆中,没有额外的开销,所以它必须是对你迄今尝试过的改进。



你可以当然,在Python中使用两个巨型字节数组作为存储,还有两个数组('I')作为索引构建一个自定义哈希表。但是这是一个更多的工作,所以我会尝试一下简单的方法。


I have been using a dict to store key value pairs where the key and value are both sha256 hash digests. I need to be able to find out if a key exists in the list, and also be able to retrieve the value for that dict.

At the moment I estimate I will need about 10Gb of memory to store 8,000,000 hashes based on some of my tests, where as the actual data being stored is only about 512MB (32 bytes per hash, so 64 bytes per record)

Does anyone have any suggestions?

UPDATE, based on some of the comments I thought I should update. I am storing the hashes as bytes, not as a hex string. I am using an sqlite database for permanent storage of the data, but inserting that many records with an index gets too slow after about 1,000,000 records, and without the index checking existence of a key gets exponentially slower as well. Thats why I want to use an in memory structure to do the lookup.

UPDATE 2

Could this work? atbr hashtable

My Solution: (Should I put this as an answer?) What I ended up doing was taking lots of advice from @abarnert creating a new class that implements 1024 lists of [count, bytearray(8000 * 32), bytearray(8000 *32)]

I use the first 10 bits of the hash as in index to which list I should store the hashes in. Then I just append the key to the next 32byte slot, and the value to the same slot in the other byte array.

I can generate 16,000,000 hashes (one for the key and one for the value) and insert the 8,000,000 key value pairs into the structure in about 30 seconds.

Searching is just the reverse I use the first 10 bits to find the list and then I just do a linear search for the hash until I find it.

Searching for 200,000 hashes picked at random from the 8,000,000 takes 30 seconds, so 40 times slower than writing, but it should be fast enough for my needs.

The bonus of it all, it now only consumes 519MB of RAM for all 8,000,000 hashes.

Thank you everyone for your help.

解决方案

First, let's look at why this is so big.

Each has is 32 bytes. That means it takes about 32 bytes to store in binary form in, e.g., the storage of a bytes or bytearray object. So far, so good.

But all Python objects have headers, typically on the order of 24-64 bytes. From a quick check, it looks like bytes objects take up an extra 36 bytes on 32 bits (possibly plus alignment padding), and 48 on 64 bits, at least on the two versions of CPython I checked.

So, how can you get rid of that 150% extra storage? Pack the bytes into one giant array, like a bytes or bytearray. Then you've got 48 bytes total plus 32 per hash, instead of 48+32 per hash. When you need to access a hash, if you've got the index, it's just the slice [index*32:(index+1)*32].

Also, depending on how you created the bytes, there can be some overflow slop. You can check that—if sys.getsizeof(s) - sys.getsizeof(b'') > len(s), you need to slice all of your objects to create new copies with no extra padding.

Anyway, now you have 8M extra indexes. If those are transient, that's fine, but if you're storing them as ints in dict value slots, then each one of them also has a header. From a quick test, on top of the 4 bytes for actual storage (for an int under 1<<31), there's a 24-byte header in both 32- and 64-bit (although very small ints can apparently be crammed into the header). So, all this does is cut your 48 bytes of waste to 28 bytes, which isn't great.

You could use some form of packed storage, like the array module. An array type I only uses 4 bytes per integer. But then you need indices into the array, which… is the same problem you just solved.

But you really don't even need the indices—if you store the keys themselves in an array, the index of any key is already the index of the hash in the byte string (divided by 32), right?

This only works if you can store the keys in some sort of compact array. If they're all around the same size, you can do this by using the same "giantbytestring" trick again. And in your case, they are—the keys are also 32-byte hashes. So you just have to keep both giant byte strings sorted by the key values (see the bisect module so you don't have to write that code yourself).

Of course using binary search algorithms instead of hashing means that you're making the lookups and inserts logarithmic instead of constant. And, while log(8M) is only around 16, which is a lot better than 8M, it's still 16x as bad as 1. But this is effectively what you get from an ideally-tuned relational database, except that you don't need to do any tuning, and it's all in memory, and there's no extra overhead, so it has to be an improvement over what you've tried so far.

You could of course build a custom hash table in Python, using the two giant byte arrays as your storage, and two array('I') as your indices. But that's a lot more work, so I'd try it the easy way first.

这篇关于大多数内存高效的方式来存储8M + sha256散列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆