选择Trie或HashMap来存储单词频率列表? [英] Choose Trie or HashMap for storing a word frequency list?

查看:262
本文介绍了选择Trie或HashMap来存储单词频率列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含100万英文单词的txt文件,其格式为:

345667

456777

...



我需要使用Java中的HashMap或Trie数据结构来存储它。稍后,我需要从列表中查找单词而无需其他操作。我的理解是,HashMap的查找速度比Trie慢,但是Trie会占用更多的内存使用量,并且Trie的实现也需要付出努力,而HashMap已经可以使用了。对于生产代码,你有什么建议或建议什么数据结构最适合这种情况?提前致谢。

此外,HashMap允许恒定时间查找。对于英文单词来说,它真的比Trie慢吗? 我的理解是, HashMap比Trie慢,但Trie会占用更多的内存使用量


这是不正确的。假设一个好的散列函数,HashMap中的查找将需要对主存储器进行少量的随机访问,而不管表的大小或其密钥的长度。相比之下,特里需要访问密钥中每个字母的主内存。因此,一个trie将导致更多的缓存未命中 - 并且在缓存未命中将主导现代硬件上的整体查找成本。

如果密钥很长,共享许多共同的前缀。

特里还支持前缀查询。



在你的情况中,键很短,你不需要前缀查询,所以你不会从一个trie中受益。


I have a txt file containing 1 million English word with their frequencies in this format:

good 345667
bad 456777
...

I need to store it using either a HashMap or a Trie data structure in Java. Later on I need to look up words from the list without other operations. My understanding is that, the look up is slower for HashMap than Trie, but Trie will take up more memory usage, and the implementation of a Trie also takes effort, while HashMap already is ready for use. For production code, do you have any advice or suggestions on what data structures best suit for this situation? Thanks in advance.

Also, HashMap allows for "constant time" for lookup. Is it really slower than a Trie for English words?

解决方案

My understanding is that, the look up is slower for HashMap than Trie, but Trie will take up more memory usage

This is incorrect. Assuming a good hash function, a lookup in a HashMap will require a small constant number of random accesses to main memory, irrespective of the size of the table, or the length of its keys. A trie, in contrast, will require an access to main memory for each letter in the key. Therefore, a trie will cause more cache misses - and in cache misses will dominate the overall lookup cost on modern hardware.

A trie can save memory if the keys are long and share many common prefixes.

A trie also supports prefix queries.

In your case, keys are short, and you don't need prefix queries, so you won't benefit from a trie.

这篇关于选择Trie或HashMap来存储单词频率列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆