在Hadoop环境中Java的内存问题 [英] Memory problems with Java in the context of Hadoop

查看:100
本文介绍了在Hadoop环境中Java的内存问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算Hadoop框架中的多路连接。当每个关系的记录从阈值变大并超过I时,会面临两个内存问题:
$ b $ 1错误:超出GC开销限制

<2>错误:Java堆空间。



链接连接和星形连接的阈值为1.000.000 /关系。 / b>

在连接计算中,我使用了一些哈希表,即

  Hashtable< V,LinkedList< K>> ht =新的Hashtable< V,LinkedList< K>>(someSize,0.75F); 

这些错误发生在散列输入记录时,只有当时。在散列期间,我有很多for循环,产生大量的临时对象。出于这个原因,我得到了1)的问题。所以,我通过设置K = StringBuilder这个最终的类来解决了这个问题。换句话说,我减少了临时对象的数量,只有少数对象的值,内容发生了变化,但不是自己。



现在,我正在处理这个问题。通过在文件$ HADOOP_HOME / hadoop / conf / hadoop-env.sh中设置适当的变量,我增加了群集中每个节点的堆空间。问题依然存在。我使用VisualVM对堆进行了非常基本的监视。我只监视主节点,尤其是JobTracker和本地TaskTracker守护进程。在监视过程中,我没有注意到任何堆溢出。此外,PermGen空间没有溢出。



所以目前,在声明中,

 哈希表< V,LinkedList< K>> ht =新的Hashtable< V,LinkedList< K>>(someSize,0.75F); 

我正在考虑设置V = SomeFinalClass。这SomeFinalClass将帮助我保持较低的对象数量,从而减少内存使用量。当然,SomeFinalClass对象默认具有与其内容相同的哈希码。所以我不能使用这个SomeFinalClass作为上面哈希表中的关键字。为了解决这个问题,我想重写默认的hashCode()方法和一个类似的String.hashCode()方法。此方法将根据SomeFinalClass对象的内容生成哈希码。



您对上述问题和解决方案有何看法?你会怎么做?



我也应该监视DataNode守护进程吗?上面的错误都是TaskTracker错误,DataNode错误或者两者兼而有之?

最后,上面的解决方案是否可以解决任意数量的记录/关系问题?或者不久之后我会再次遇到同样的问题?

解决方案

使用ArrayList而不是LinkedList,它会使用很多更少的内存。

另外我建议使用HashMap而不是Hastable,因为后者是传统类。


I want to compute a multiway join in Hadoop framework. When the records of each relation get bigger from a threshold and beyond I face two memory problems,

1) Error: GC overhead limit exceeded,

2) Error: Java heap space.

The threshold is the 1.000.000 / relation for a chain join and a star join.

In the join computation I use some hash tables i.e.

Hashtable< V, LinkedList< K>> ht = new Hashtable< V, LinkedList< K>>( someSize, o.75F);

These errors occur when I hash the input records and only then for the moment. During the hashing I have quite many for loops which, produce a lot of temporary objects. For this reason I get the 1) problem. So, I solved the 1) problem by setting K = StringBuilder which is a final class. In other words I reduced the amount of temporary objects by having only few objects which their value, content changes but not themselves.

Now, I am dealing with the 2) problem. I increased the heap space in each of the nodes of my cluster by setting the appropriate variable in the file $HADOOP_HOME/hadoop/conf/hadoop-env.sh. The problem still remained. I did a very basic monitoring of the heap by using VisualVM. I monitored only the master node and especially the JobTracker and the local TaskTracker daemons. I didn't notice any heap overflow during this monitoring. Also the PermGen space didn't overflow.

So for the moment, in the declaration,

Hashtable< V, LinkedList< K>> ht = new Hashtable< V, LinkedList< K>>( someSize, o.75F);

I am thinking of setting V = SomeFinalClass. This SomeFinalClass will help me to keep the amount of objects low and consequently the memory usage. Of course a SomeFinalClass object will have the same hash code independently of its content by default. So I will not be able to use this SomeFinalClass as a key in the hash table above. In order to solve this problem I am thinking of overriding the default hashCode() method and by a similar String.hashCode() method. This method will produce a hash code based on the content of a SomeFinalClass object.

What is your opinion on the problems and the solutions above? What would you do?

Should I monitor also the DataNode daemon? Both of the errors above are TaskTracker errors, DataNode errors or both?

Finally, will the solutions above solve the problems for an arbitrary amount of records / relation? Or soon or later I will have the same problem again?

解决方案

Use an ArrayList instead of a LinkedList and it will use a lot less memory.

Also I suggest using a HashMap instead of Hastable as the later is a legacy class.

这篇关于在Hadoop环境中Java的内存问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆