Hashcode bucket在java中的分布 [英] Hashcode bucket distribution in java

查看:214
本文介绍了Hashcode bucket在java中的分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我需要在Hashset中存储1000个对象,是否最好拥有1000个包含每个对象的桶(通过为每个对象生成唯一的哈希码值),或者拥有10个大致包含100个对象的桶?



拥有唯一桶的一个优点是我可以在调用equals()方法时保存执行周期?



设置桶的数量,并尽可能均匀地分配对象。



理想的桶对比应该是多少?

$ b $

b

A HashSet 应该能够平均确定O(1)时间的成员资格。从文档


这个类提供了基本操作(添加,删除,包含和大小)的恒定时间性能,假设散列函数将元素正确地分散在桶中。


算法a Hashset 用来实现这一目的是检索对象的哈希码,使用这个找到正确的桶。然后它遍历存储桶中的所有项目,直到找到一个相等的项目。如果桶中的项目数量大于O(1),则查找将花费比O(1)时间更长的时间。



在最坏的情况下 - 如果所有项目哈希到同一个桶 - 它将需要O(n)时间来确定一个对象是否在集合中。


理想的物料桶比例应该是多少?


这里有一个时空折衷。增加桶的数量可减少冲突的机会。但是它也增加了内存需求。哈希集具有两个参数 initialCapacity loadFactor ,允许您调整有多少个桶 HashSet 应该创建。默认负载系数为0.75,这对于大多数用途都适用,但如果您有特殊要求,则可以选择其他值。



有关这些参数的详细信息,请参见 HashMap


这个实现提供了基本操作(get和put)的恒定时间性能,假设散列函数分散元素在桶之间正确。对集合视图的迭代需要与HashMap实例的容量(桶的数量)加上其大小(键值映射的数量)成比例的时间。因此,如果迭代性能很重要,不要将初始容量设置得太高(或负载系数太低),这一点非常重要。



HashMap的实例有两个影响其性能的参数:初始容量和负载因子。容量是哈希表中的桶数,初始容量只是创建哈希表时的容量。负载因子是衡量哈希表在容量自动增加之前允许达到多少的量度。当哈希表中的条目数超过负载因子和当前容量的乘积时,通过调用rehash方法将容量大约增加一倍。



一般来说,默认负载因子(.75)在时间和空间成本之间提供了良好的折衷。较高的值会减少空间开销,但会增加查找成本(反映在HashMap类的大多数操作中,包括get和put)。在设置其初始容量时,应考虑映射中的预期条目数及其负载系数,以便最小化重新散列操作的数量。如果初始容量大于最大条目数除以负载系数,则不会发生重新擦除操作。



Suppose I need to store 1000 objects in Hashset, is it better that I have 1000 buckets containing each object( by generating unique value for hashcode for each object) or have 10 buckets roughly containing 100 objects?

1 advantage of having unique bucket is that I can save execution cycle on calling equals() method?

Why is it important to have set number of buckets and distribute the objects amoung them as evenly as possible?

What should be the ideal object to bucket ratio?

解决方案

Why is it important to have set number of buckets and distribute the objects amoung them as evenly as possible?

A HashSet should be able to determine membership in O(1) time on average. From the documentation:

This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets.

The algorithm a Hashset uses to achieve this is to retrieve the hash code for the object and use this to find the correct bucket. Then it iterates over all the items in the bucket until it finds one that is equal. If the number of items in the bucket is greater than O(1) then lookup will take longer than O(1) time.

In the worst case - if all items hash to the same bucket - it will take O(n) time to determine if an object is in the set.

What should be the ideal object to bucket ratio?

There is a space-time tradeoff here. Increasing the number of buckets decreases the chance of collisions. However it also increases memory requirements. The hash set has two parameters initialCapacity and loadFactor that allow you to adjust how many buckets the HashSet should create. The default load factor is 0.75 and this is fine for most purposes, but if you have special requirements you can choose another value.

More information about these parameters can be found in the documentation for HashMap:

This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity is roughly doubled by calling the rehash method.

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

这篇关于Hashcode bucket在java中的分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆