Java HashMap性能优化/替代 [英] Java HashMap performance optimization / alternative

查看:388
本文介绍了Java HashMap性能优化/替代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个大的HashMap,但是 put()性能不够好。任何想法?

I want to create a large HashMap but the put() performance is not good enough. Any ideas?

其他数据结构建议是受欢迎的,但我需要Java Map的查找功能:

Other data structure suggestions are welcome but I need the lookup feature of a Java Map:

map.get(key)

在我的情况下,我想创建一个26万条目的地图。使用标准Java HashMap,2-3百万次插入后,投放率变得无法忍受。

In my case I want to create a map with 26 million entries. Using the standard Java HashMap the put rate becomes unbearably slow after 2-3 million insertions.

另外,有没有人知道如果使用不同的哈希码分发的密钥可以帮助?

Also, does anyone know if using different hash code distributions for the keys could help?

我的哈希码方法:

byte[] a = new byte[2];
byte[] b = new byte[3];
...

public int hashCode() {
    int hash = 503;
    hash = hash * 5381 + (a[0] + a[1]);
    hash = hash * 5381 + (b[0] + b[1] + b[2]);
    return hash;
}

我正在使用附加的关联属性来确保相等的对象具有相同的哈希码。数组是值为0到51范围内的字节。值只能在两个数组中使用一次。如果数组包含相同的值(以任意一个顺序),并且b数组也相同,则对象是相等的。所以a = {0,1} b = {45,12,33}和a = {1,0} b = {33,45,12}相等。

I am using the associative property of addition to ensure that equal objects have the same hashcode. The arrays are bytes with values in the range 0 - 51. Values are only used once in either array. The objects are equal if the a arrays contain the same values (in either order) and the same goes for the b array. So a = {0,1} b = {45,12,33} and a = {1,0} b = {33,45,12} are equal.

编辑,一些注释:


  • 有几个人批评使用哈希图或其他数据结构来存储2600万条。我不明白为什么这看起来很奇怪。它看起来像一个经典的数据结构和算法问题给我。我有2600万个项目,我希望能够从数据结构快速插入并查找它们:给我数据结构和算法。

  • A few people have criticized using a hash map or other data structure to store 26 million entries. I cannot see why this would seem strange. It looks like a classic data structures and algorithms problem to me. I have 26 million items and I want to be able to quickly insert them into and look them up from a data structure: give me the data structure and algorithms.

将默认Java HashMap的初始容量设置为26万降低性能

Setting the initial capacity of the default Java HashMap to 26 million decreases the performance.

有些人建议使用数据库一些其他的情况绝对是聪明的选择。但是我真的在问一个数据结构和算法问题,一个完整的数据库会比一个好的数据结构解决方案(毕竟数据库只是软件,但是会有通信和可能的磁盘开销)会变得太过分了。

Some people have suggested using databases, in some other situations that is definitely the smart option. But I am really asking a data structures and algorithms question, a full database would be overkill and much slower than a good datastructure solution (after all the database is just software but would have communication and possibly disk overhead).

推荐答案

许多人指出 hashCode()方法是责怪的。它只产生了大约20,000个代码,用于2600万个不同的对象。这是每个哈希桶的平均1,300个对象=非常非常糟糕。但是,如果我将这两个数组转换为基数52中的数字,我保证为每个对象获得唯一的哈希码:

As many people pointed out the hashCode() method was to blame. It was only generating around 20,000 codes for 26 million distinct objects. That is an average of 1,300 objects per hash bucket = very very bad. However if I turn the two arrays into a number in base 52 I am guaranteed to get a unique hash code for every object:

public int hashCode() {       
    // assume that both a and b are sorted       
    return a[0] + powerOf52(a[1], 1) + powerOf52(b[0], 2) + powerOf52(b[1], 3) + powerOf52(b[2], 4);
}

public static int powerOf52(byte b, int power) {
    int result = b;
    for (int i = 0; i < power; i++) {
        result *= 52;
    }
    return result;
}

数组排序以确保此方法符合 hashCode()相等的对象具有相同的哈希码。使用旧方法,平均每秒10万次的投注数为100,000到2,000,000,分别为:

The arrays are sorted to ensure this methods fulfills the hashCode() contract that equal objects have the same hash code. Using the old method the average number of puts per second over blocks of 100,000 puts, 100,000 to 2,000,000 was:

168350.17
109409.195
81344.91
64319.023
53780.79
45931.258
39680.29
34972.676
31354.514
28343.062
25562.371
23850.695
22299.22
20998.006
19797.799
18702.951
17702.434
16832.182
16084.52
15353.083

使用新方法给出:

337837.84
337268.12
337078.66
336983.97
313873.2
317460.3
317748.5
320000.0
309704.06
310752.03
312944.5
265780.75
275540.5
264350.44
273522.97
270910.94
279008.7
276285.5
283455.16
289603.25

好多了旧的方法非常快速地结束,而新的方法保持良好的吞吐量。

Much much better. The old method tailed off very quickly while the new one keeps up a good throughput.

这篇关于Java HashMap性能优化/替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆