用hadoop指定内存限制 [英] Specifying memory limits with hadoop

查看:569
本文介绍了用hadoop指定内存限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Hadoop集群上运行高内存作业(0.20.203)。我修改了mapred-site.xml以强制限制内存。


<名称> mapred.cluster.max.map.memory.mb< / name>
<值> 4096< /值>
< / property>
<属性>
<名称> mapred.cluster.max.reduce.memory.mb< / name>
<值> 4096< /值>
< / property>
<属性>
<名称> mapred.cluster.map.memory.mb< / name>
<值> 2048< /值>
< / property>
<属性>
<名称> mapred.cluster.reduce.memory.mb< / name>
<值> 2048< /值>
< / property>

在我的工作中,我指定了需要多少内存。不幸的是,即使我正在使用 -Xmx2g 运行我的进程(作为一个控制台应用程序,作业可以很好地运行),我需要为我的内存请求更多的内存mapper(作为一个子问题,为什么是这样?)或者它被杀死。

  val conf = new Configuration()
conf.set(mapred.child.java.opts,-Xms256m -Xmx2g -XX:+ UseSerialGC);
conf.set(mapred.job.map.memory.mb,4096);
conf.set(mapred.job.reduce.memory.mb,1024);

由于我正在执行身份缩减器,因此减速器几乎不需要任何内存。

  class IdentityReducer [K,V]扩展Reducer [K,V,K,V] {
覆盖def reduce(key:K,
values:java.lang.Iterable [V],
context:Reducer [K,V,K,V] #Context){
for(v < - values){
上下文写(key,v)
}
}
}

但是,减速器仍然使用大量内存。是否有可能给reducer不同的映射器JVM参数? Hadoop杀死了reducer并声称它使用了3960 MB的内存!减员最终失败了。这怎么可能?

  TaskTree [pid = 10282,tipID = attempt_201111041418_0005_r_000000_0]超出内存限制。 
当前使用情况:4152717312bytes。
限制:1073741824bytes。
杀戮任务。

更新:甚至当我使用 cat 作为映射器, uniq 作为reducer和 -Xms512M -Xmx1g -XX:+ UseSerialGC 我的任务超过2g的虚拟内存!这看起来很奢侈,最大堆大小是4倍。

  TaskTree [pid = 3101,tipID = attempt_201111041418_0112_m_000000_0]超出内存限制。 
当前使用情况:2186784768bytes。
限制:2147483648bytes。
杀戮任务。

更新:原创JIRA 用于更改内存使用的配置格式,特别提到Java用户主要对物理内存感兴趣以防止抖动。我认为这正是我想要的:如果可用的物理内存不足,我不希望节点启动映射器。但是,这些选项似乎都是作为虚拟内存约束来实现的,这些约束很难管理。解决方案

检查你的ulimit。从版本0.20.2开始, Cloudera 可能适用类似问题对于以后的版本:


...如果您设置了mapred.child.ulimit,重要的是它必须多于
是mapred.child.java.opts中设置的堆大小值的两倍。对于
例如,如果您设置了1G堆,请将mapred.child.ulimit设置为2.5GB。 Child
进程现在保证至少fork一次,而fork
暂时需要虚拟内存两倍的开销。


也可能以编程方式设置mapred.child.java.opts为太晚;你可能想验证它是否真正生效,如果没有,就把它放到你的mapred-site.xml中。


I am trying to run a high-memory job on a Hadoop cluster (0.20.203). I modified the mapred-site.xml to enforce some memory limits.

  <property>
    <name>mapred.cluster.max.map.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapred.cluster.max.reduce.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapred.cluster.map.memory.mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>mapred.cluster.reduce.memory.mb</name>
    <value>2048</value>
  </property>

In my job, I am specifying how much memory I will need. Unfortunately, even though I am running my process with -Xmx2g (the job will run just fine with this much memory as a console application) I need to request much more memory for my mapper (as a subquestion, why is this?) or it is killed.

val conf = new Configuration()
conf.set("mapred.child.java.opts", "-Xms256m -Xmx2g -XX:+UseSerialGC");
conf.set("mapred.job.map.memory.mb", "4096");
conf.set("mapred.job.reduce.memory.mb", "1024");

The reducer needs hardly any memory since I am performing an identity reducer.

  class IdentityReducer[K, V] extends Reducer[K, V, K, V] {
    override def reduce(key: K,
        values: java.lang.Iterable[V],
        context:Reducer[K,V,K,V]#Context) {
      for (v <- values) {
        context write (key, v)
      }
    }
  }

However, the reducer is still using a lot of memory. Is it possible to give the reducer different JVM arguments than the mapper? Hadoop kills the reducer and claims it is using 3960 MB of memory! And the reducers end up failing the job. How is this possible?

TaskTree [pid=10282,tipID=attempt_201111041418_0005_r_000000_0] is running beyond memory-limits.
Current usage : 4152717312bytes.
Limit : 1073741824bytes.
Killing task.

UPDATE: even when I specify a streaming job with cat as the mapper and uniq as the reducer and -Xms512M -Xmx1g -XX:+UseSerialGC my tasks take over 2g of virtual memory! This seems extravagant at 4x the max heap size.

TaskTree [pid=3101,tipID=attempt_201111041418_0112_m_000000_0] is running beyond memory-limits.
Current usage : 2186784768bytes.
Limit : 2147483648bytes.
Killing task.

Update: the original JIRA for changing the configuration format for memory usage specifically mentions that Java users are mostly interested in physical memory to prevent thrashing. I think this is exactly what I want: I don't want a node to spin up a mapper if there is inadequate physical memory available. However, these options all seem to have been implemented as virtual memory constraints, which are difficult to manage.

解决方案

Check your ulimit. From Cloudera, on version 0.20.2, but a similar issue probably applies for later versions:

...if you set mapred.child.ulimit, it's important that it must be more than two times the heap size value set in mapred.child.java.opts. For example, if you set a 1G heap, set mapred.child.ulimit to 2.5GB. Child processes are now guaranteed to fork at least once, and the fork momentarily requires twice the overhead in virtual memory.

It's also possible that setting mapred.child.java.opts programmatically is "too late"; you might want to verify it really is going into effect, and put it in your mapred-site.xml if not.

这篇关于用hadoop指定内存限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆