在Java中计算HashMap开销 [英] Calculating HashMap overhead in Java

查看:142
本文介绍了在Java中计算HashMap开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在存储一个hashmap中的1000个对象。这个hashmap被扩展为允许我将三维坐标映射到存储在其中的对象;里面的物体有一个固定的大小。哈希密钥是一个长整数。



我将如何解释(数学上)这个结构的可能开销?


  1. 是否足够重要,例如,如果内部的数据大约在256mb,开销会很重要?

  2. 有一种可靠的方法(除了在一些情况下我发现的分析器不可靠)以数学计算其开销应该是?

我对hashmap的总大小不感兴趣 - 只有使用hashmap将会导致的开销。例如,如果我有10个int,它们是4个字节,所以它是40个字节。如果我把它们放在一个数组中,我得到一个不变的开销,12个字节 - 8个对象头,4个长度。如果我把它们放在另一个结构(例如TreeSet)中,我的开销不会是常数,因为树需要节点 - 所以我可能得到一个用n表示的开销,其中n是集合中的项目数。



对于我来说,一些事情是显而易见的,我将以此为出发点。


  1. 我将需要存储至少1000多个。这些是可空类型,因此它们实际上是对象。因此,我将假设所使用的8字节长整数具有8字节的对象头。我将添加一个因子为16n。

  2. 我还需要引用每个对象,无论对象是从地图中被调用并被使用,都必须存在;所以这是每个对象另外8个字节。我们可以把它考虑进数据大小,但是由于引用是在hashmap本身,我觉得最好把它们作为开销的一部分。我的逻辑如下:如果我把所有的数据从hashmap中存储并存储在变量中,那么这些n引用仍然存在于hashmap中,前提是我没有删除这些数据对象,我不会做的该对象集是常量的,尽管它们可以使用不同的密钥进行回收。

  3. hashmap本身的开销为8字节。

  4. hashmap 必须存储项目数量(或所以我认为!),这是4个字节。

  5. 我会毫无疑问地假设哈希键是一个数组,按散列键顺序排列。这是数组的12个字节。

  6. 我将会无知地假设对象是在一个匹配的数组中,当它找到关键字时,它将被取消引用。我会猜到另外12个字节。

这给出了一个多项式方程:36 + 24n



因此,我有一个使用长键的1000个数据对象的24036字节开销的猜测。这是一个微不足道的开销,但我的问题是,真正的开销是什么,只是坐在那里?






第二个有效的问题是,从JVM到JVM有多少不同?有没有任何JVM独立的方式来计算出来?为了说明我的意思,考虑一个仅具有32位对象标题的JVM - 当查看可能会说的数组时,即使大小从JVM到JVM不同,这是一个公平的估计,数组上的开销将变为8个字节,而不是在这种情况下12。



我假定在同一版本的Java中使用HashMap的固定实现。






我可以尝试阅读源代码或运行分析,但这可能会产生基于我的JVM的误导性结果。我要求你的帮助 - 也许有人知道 - 对于一些信息,我们都不了解情况。谢谢!






看下面的答案,实际估算值可以表示如下:



每个条目8个字,每个长度加上8个字节,加上hashmap对象头的8个字节。



在我目前的环境(32位操作系统),这使得1个字= 4个字节。




  • 在32位环境中为40n + 8:1000个条目为〜40k
  • $在64位环境中,b $ b
  • 72n + 8。对于1000个条目,约72k。



所以似乎低于100kbytes

解决方案

以下博客提供了一些松散的数学题目。

这个 Google代码网站可以看看这些事情如何完成。



引用链接的链接:

 这是我编译的作弊表。 

要计算单个(键值)条目的成本:

如果使用HashMap或ConcurrentHashMap,则成本为8个字(32个字节)


所以,从javadoc中考虑这个例子:

LoadCache图= CacheBuilder.newBuilder()
.maximumSize(10000)
.expireAfterWrite(10 ,TimeUnit.MINUTES)
.removalListener(MY_LISTENER)
.build(
new CacheLoader(){
public Graph load(Key key)throws AnyException {
return createExpensiveGraph (key);
}
});


此结构中的条目的成本计算如下:

它是一个缓存:+12个字
它使用maximumSize() :+4个词
它使用过期:+4个词

因此,每个(键值)条目将占用20个字(因此在32位虚拟机中为80个字节,或160个在64位)。

为了估计垃圾收集器中的开销,可以计算每个条目引入的引用(指针),垃圾回收器将必须遍历计算对象可达性。同样的列表,这次只计数引用:

如果你使用HashMap或ConcurrentHashMap,成本是5引用


Let's say I'm storing 1000 objects in a hashmap. This hashmap is extended to allow me to map three dimensional coordinates to the objects stored in it; the objects inside have a fixed size. The hash key is a long integer.

How would I go about figuring out (mathematically) the probable overhead for this structure?

  1. Is it significant enough that, for instance, if the data inside is around 256mb that the overhead will matter?
  2. Is there a reliable way (Aside from a profiler, which I've found are unreliable in some cases) to mathematically calculate what its overhead should be?

I'm not interested in the total size of the hashmap - only the overhead that using the hashmap will incur. For instance, if I have 10 ints they're 4 bytes a piece, so it's 40 bytes. If I stick them in an array, I get a constant overhead of 12 bytes - 8 for the object header, 4 for the length. If I put them in another structure (a TreeSet for instance) my overhead will not be constant because a tree needs nodes - so I might get an overhead expressed in terms of n where n is the number of items in the set.

A few things are obvious to me, which I'll give as my starting point here.

  1. I will need to store at least 1000 longs. These are nullable types, so they're actually objects. I will assume therefore that the 8 byte long integer being used has an object header also of 8 bytes. I will add a factor of 16n.
  2. I will need references to every object as well, which must exist whether or not the object has been recalled from the map and is being used; so that's an additional 8 bytes per object. We could factor it into the data size instead, but since the references are in the hashmap itself, I feel like it's best to make them part of the overhead. My logic is as follows: If I took all of the data out of the hashmap and stored it in variables, those n references would still exist in the hashmap, provided I didn't remove these data objects, which I won't be doing. The set of objects is constant, though they may be recycled with a different key.
  3. The hashmap itself has an overhead of 8 bytes.
  4. The hashmap must store the number of items inside (or so I think!) so that's 4 bytes.
  5. I will suppose ignorantly that the hash keys are in an array, sorted by hash key order. That's 12 bytes for the array.
  6. I will assume ignorantly as well that the objects are in a matching array, which it dereferences when it finds the key. I will guess another 12 bytes.

This gives me a polynomial equation: 36 + 24n

Thus I have a guess of 24036 bytes overhead for 1000 data objects using long keys. This is somewhat of an insignificant overhead, but my question for you is, what is the real overhead, just sitting there?


A secondary valid question is, how much does this vary from JVM to JVM? Is there any JVM independent way to figure it out? To exemplify what I mean, consider a JVM that only has 32bit object headers - when looking at arrays you might say, even though the size varies from JVM to JVM, it's a fair estimate that the overhead on an array would become 8 bytes instead of 12 in that case.

I'm assuming a fixed implementation of HashMap across the same version of Java.


I could try to read the source code or run profiling, this however may produce misleading results based on my JVM. I'm asking for your help - perhaps someone who knows - for some piece of info that we both don't already know about the situation. Thanks!


See the answer below, the actual estimate can be expressed as follows:

8 words per entry, plus 8 bytes for each long, plus 8 bytes for the hashmap object header.

In my present environment (32 bit OS) that makes 1 word = 4 bytes.

  • 40n + 8 in a 32bit environment: ~ 40k for 1000 entries
  • 72n + 8 in a 64bit environment: ~ 72k for 1000 entries.

So it seems to be under 100kbytes.

解决方案

The following blog post provides some loose mathematics on the topic.
This google code site provides a look at how these things are done.

Quoting the links in case of link rot:

This is the cheat-sheet I compiled.

To compute the cost of a single (key, value) entry:

    If you use HashMap or ConcurrentHashMap, the cost is 8 words (32 bytes)


 So, consider this example from the javadoc:

   LoadingCache graphs = CacheBuilder.newBuilder()
       .maximumSize(10000)
       .expireAfterWrite(10, TimeUnit.MINUTES)
       .removalListener(MY_LISTENER)
       .build(
           new CacheLoader() {
             public Graph load(Key key) throws AnyException {
               return createExpensiveGraph(key);
             }
           });


The cost of an Entry in this structure this is computed as follows:

    It's a Cache: +12 words
    It uses maximumSize(): +4 words
    It uses expiration: +4 words

Thus, each (key, value) entry would have a footprint of 20 words (thus 80 bytes in a 32bit VM, or 160 in a 64bit one). 

To estimate the overhead imposed in the garbage collector, one could count how many references (pointers) each entry introduces, which the garbage collector would have to traverse to compute object reachability. The same list again, this time only counting references:

    If you use HashMap or ConcurrentHashMap, the cost is 5 references

这篇关于在Java中计算HashMap开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆