在Hadoop mapreduce作业中重用JVM [英] reuse JVM in Hadoop mapreduce jobs

查看:620
本文介绍了在Hadoop mapreduce作业中重用JVM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道我们可以设置属性mapred.job.reuse.jvm.num.tasks来重新使用JVM。我的问题是:
$ b $(1)如何确定这里设置的任务数量,-1或其他一些正整数?

(2)在mapreduce作业中重用JVM并将此属性设置为-1值是个不错的主意?

谢谢非常多!

解决方案

如果您有非常小的任务,它们确实在彼此之后运行,将此属性设置为-1(意味着衍生的JVM将被无限次重用)。
所以你只是产生了(你的集群中可用的任务数量)-JVM而不是(任务数量)-JVMs。

这是一个巨大的性能提升。在长时间运行的作业中,运行时比较设置新JVM的百分比非常低,所以它不会给您带来巨大的性能提升。



运行任务,重新创建任务进程是很好的,因为堆碎片会降低性能。

另外,如果您有一些中等时间运行的作业,你可以重复使用2-3个任务,并有很好的权衡。

I know we can set the property "mapred.job.reuse.jvm.num.tasks" to re-use JVM. My questions are:

(1) how to decide the number of tasks to be set here, -1 or some other positive integers?

(2) is it a good idea to already reuse JVMs and set this property to the value of -1 in mapreduce jobs?

Thank you very much!

解决方案

If you have very small tasks that are definitely running after each other, it is useful to set this property to -1 (meaning that a spawned JVM will be reused unlimited times). So you just spawn (number of task in your cluster available to your job)-JVMs instead of (number of tasks)-JVMs.

This is a huge performance improvement. In long running jobs the percentage of the runtime in comparision to setup a new JVM is very low, so it doesn't give you a huge performance boost.

Also in long running tasks it is good to recreate the task process, because of issues like heap fragmentation degrading your performance.

In addition, if you have some mid-time-running jobs, you could reuse just 2-3 of the tasks, having a good trade-off.

这篇关于在Hadoop mapreduce作业中重用JVM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆