Java HashMap 性能优化/替代 [英] Java HashMap performance optimization / alternative

查看:32
本文介绍了Java HashMap 性能优化/替代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个大的 HashMap 但 put() 性能不够好.有什么想法吗?

I want to create a large HashMap but the put() performance is not good enough. Any ideas?

欢迎其他数据结构建议,但我需要 Java Map 的查找功能:

Other data structure suggestions are welcome but I need the lookup feature of a Java Map:

map.get(key)

就我而言,我想创建一个包含 2600 万个条目的地图.使用标准的 Java HashMap,在插入 2-3 百万次之后,放置速度会变得难以忍受.

In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

另外,有谁知道对密钥使用不同的哈希码分布是否有帮助?

Also, does anyone know if using different hash code distributions for the keys could help?

我的哈希码方法:

byte[] a = new byte[2];
byte[] b = new byte[3];
...

public int hashCode() {
    int hash = 503;
    hash = hash * 5381 + (a[0] + a[1]);
    hash = hash * 5381 + (b[0] + b[1] + b[2]);
    return hash;
}

我使用加法的关联属性来确保相等的对象具有相同的哈希码.数组是值在 0 - 51 范围内的字节.值在任一数组中仅使用一次.如果 a 数组包含相同的值(以任一顺序),则对象相等,而 b 数组也是如此.所以 a = {0,1} b = {45,12,33} 和 a = {1,0} b = {33,45,12} 是相等的.

I am using the associative property of addition to ensure that equal objects have the same hashcode. The arrays are bytes with values in the range 0 - 51. Values are only used once in either array. The objects are equal if the a arrays contain the same values (in either order) and the same goes for the b array. So a = {0,1} b = {45,12,33} and a = {1,0} b = {33,45,12} are equal.

编辑,一些注意事项:

  • 有些人批评使用哈希映射或其他数据结构来存储 2600 万个条目.我不明白为什么这看起来很奇怪.对我来说,这看起来像是一个经典的数据结构和算法问题.我有 2600 万个项目,我希望能够快速将它们插入到数据结构中并从数据结构中查找:给我数据结构和算法.

  • A few people have criticized using a hash map or other data structure to store 26 million entries. I cannot see why this would seem strange. It looks like a classic data structures and algorithms problem to me. I have 26 million items and I want to be able to quickly insert them into and look them up from a data structure: give me the data structure and algorithms.

将默认 Java HashMap 的初始容量设置为 2600 万降低性能.

Setting the initial capacity of the default Java HashMap to 26 million decreases the performance.

有些人建议使用数据库,在其他一些情况下这绝对是明智的选择.但我真的在问一个数据结构和算法的问题,一个完整的数据库会比一个好的数据结构解决方案大材小用,而且慢得多(毕竟数据库只是软件,但会有通信和可能的磁盘开销).

Some people have suggested using databases, in some other situations that is definitely the smart option. But I am really asking a data structures and algorithms question, a full database would be overkill and much slower than a good datastructure solution (after all the database is just software but would have communication and possibly disk overhead).

推荐答案

正如许多人指出的那样,hashCode() 方法是罪魁祸首.它只为 2600 万个不同的对象生成了大约 20,000 个代码.也就是说,每个哈希桶平均有 1,300 个对象 = 非常非常糟糕.但是,如果我将两个数组转换为以 52 为基数的数字,我保证会为每个对象获得一个唯一的哈希码:

As many people pointed out the hashCode() method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:

public int hashCode() {       
    // assume that both a and b are sorted       
    return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}

public static int powerOf52(byte b, int power) {
    int result = b;
    for (int i = 0; i < power; i++) {
        result *= 52;
    }
    return result;
}

对数组进行排序以确保此方法满足 hashCode() 约定,即相等的对象具有相同的哈希码.使用旧方法,在 100,000 个 puts、100,000 到 2,000,000 个块上的每秒平均 puts 数是:

The arrays are sorted to ensure this methods fulfills the hashCode() contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts, 100,000 to 2,000,000 was:

168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083

使用新方法给出:

337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25

好多了.旧方法很快就结束了,而新方法保持了良好的吞吐量.

Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.

这篇关于Java HashMap 性能优化/替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆