需要内存高效的方式来存储吨字符串(是:在Java中的HAT-Trie实现) [英] Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

查看:233
本文介绍了需要内存高效的方式来存储吨字符串(是:在Java中的HAT-Trie实现)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理大量的(5-20​​万)字符串键(平均长度为10个字符),我需要将它们存储在内存数据结构中在常量时间或接近常量的时间内支持以下操作:

  //如果输入存在于容器中,则返回true,否则返回false 
public boolean contains(String input)

Java的HashMap被证明不仅仅是就吞吐量而言令人满意,但占用大量内存。我正在寻找一种能够提高内存效率的解决方案,并且仍然支持相当的吞吐量(与散列相当或几乎与散列相当)。



我不在乎插入/删除时间。在我的应用程序中,我将仅执行插入操作(仅在启动时),并且随后只会在应用程序的整个生命周期中使用 contains 方法查询数据结构。 / p>

我读到HAT-Trie数据结构与我的需求最接近。我想知道是否有一个库有一个实现。



其他建议指向实现的实现欢迎。



感谢您。

解决方案

对于您的约束,trie似乎是一个非常好的主意。 b $ b

一个在盒子外面思考的选择:

如果您可以承担某些不存在的字符串存在的概率



编辑:如果您可以负担误报,请使用布隆过滤器,如注释中的WizardOfOdds所示。



对于k = 1,布隆过滤器就像一个没有密钥的哈希表:每个桶一个布尔值,用于告知是否存在至少一个具有相同散列的输入。如果1%的误报率是可以接受的,那么你的散列表可以小到大约100×20百万比特或大约200兆比特。对于千分之一的误报,2GiB。

使用多个散列函数而不是一个散列函数可以提高相同位数的误报率。


I am working with a large set (5-20 million) of String keys (average length 10 chars) which I need to store in an in memory data structure that supports the following operation in constant time or near constant time:

// Returns true if the input is present in the container, false otherwise
public boolean contains(String input)

Java's Hashmap is proving to be more than satisfactory as far as throughput is concerned but is taking up a lot of memory. I am looking for a solution that is memory efficient and still supports a throughput that is decent (comparable with or nearly as good as hashing).

I don't care about the insertion/deletion times. In my application, I will be performing only insertions (only at startup time) and will subsequently only be querying the data structure using the contains method for the life of the application.

I read that the HAT-Trie data structure is closest for my needs. I am wondering if there is a library that has an implementation.

Other suggestions with pointers to implementations welcome.

Thank You.

解决方案

The trie seems like a very good idea for your constraints.

A "thinking outside the box" alternative:

If you can afford some probability of answering "present" for a string that is absent

EDIT: if you can afford false positives, use a Bloom filter as suggested by WizardOfOdds in the comments.

For k=1, a Bloom filter is like a hash table without the keys: each "bucket" is simply a boolean that tells if at least one input with the same hash was present. If 1% false positives is acceptable, your hash table can be as small as about 100 * 20 million bits or roughly 200 MiB. For 1 in 1000 false positives, 2GiB.

Using several hash functions instead of one can improve the false positive rate for the same amount of bits.

这篇关于需要内存高效的方式来存储吨字符串(是:在Java中的HAT-Trie实现)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆