如何指定映射配置&使用Amazon的EMR在CLI中使用定制jar的java选项？ [英] How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

查看：168 发布时间：2018/5/31 19:09:03 java hadoop mapreduce elastic-map-reduce emr

本文介绍了如何指定映射配置&使用Amazon的EMR在CLI中使用定制jar的java选项？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道如何在使用自定义jar运行流作业时指定mapreduce配置，例如 mapred.task.timeout，mapred.min.split.size 等。

使用外部脚本语言（如ruby或python）运行时，我们可以使用以下方法来指定这些配置：

ruby elastic-mapreduce -j --stream --step-namemystream--jobconf mapred.task.timeout = 0 --jobconf mapred.min.split.size = 52880 --mapper s3：// somepath / mapper.rb --reducer s3：somepath / reducer.rb --input s3：// somepath / input --output s3：// somepath / output

我尝试了以下方法，但它们都没有工作：

ruby elastic-mapreduce --jobflow - -jar s3：//somepath/job.jar --arg s3：// somepath / input --arg s3：// somepath / output --args -m，mapred.min.split.size = 52880 -m，mapred .task.timeout = 0

ruby elastic-mapreduce --jobflow --jar s3：//somepath/job.jar --arg s3： // somepath / input --arg s3：// somepath / output --args -jobconf，mapred.min.split.size = 52880 -jobconf，mapred.task.timeout = 0

我还想知道如何使用EMR中的自定义jar将java选项传递给流作业。
在hadoop上本地运行时，我们可以按如下方式传递它：

bin / hadoop jar job.jar input_path output_path -D < some_java_parameter> =< some_value>
解决方案
我相信如果您想在每个工作的基础上设置这些，那么您需要A）对于自定义Jar，将它们作为参数传递到您的JAR中，并自行处理它们。我相信这可以自动化，如下所示：
public static void main（String [] args）throws Exception { 配置conf =新配置（）; args = new GenericOptionsParser（conf，args）.getRemainingArgs（）; // .... }
然后在此创建作业方式（虽然没有验证，如果工程虽然）：
> elastic-mapreduce --jar s3：//mybucket/mycode.jar \ --args-D，mapred.reduce.tasks = 0 --arg s3：// mybucket / input \ --arg s3：// mybucket / output

GenericOptionsParser 应自动将-D和-jobconf参数传递到Hadoop的作业设置中。更多细节： http：// hadoop。 b）为hadoop流媒体jar，apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

您也只需将配置更改传递给命令即可。

 > elastic-mapreduce --jobflow j-ABABABABA \ 
 --stream --jobconf mapred.task.timeout = 600000 \ 
 --mapper s3：//mybucket/mymapper.sh \ 
 --reducer s3：//mybucket/myreducer.sh \ 
 --input s3：// mybucket / input \ 
 --output s3：// mybucket / output \ 
 --jobconf mapred.reduce.tasks = 0

更多详细信息： https://forums.aws.amazon.com/thread.jspa?threadID=43872 和 elastic-mapreduce --help

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar.

We can use the following way to specify these configurations when we run using external scripting languages like ruby or python:

ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output

I tried the following ways, but none of them worked:

ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0
ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -jobconf,mapred.min.split.size=52880 -jobconf,mapred.task.timeout=0

I would also like to know how to pass java options to a streaming job using custom jar in EMR. When running locally on hadoop we can pass it as follows:

bin/hadoop jar job.jar input_path output_path -D< some_java_parameter >=< some_value >

解决方案

I believe if you want to set these on a per-job basis, then you need to

A) for custom Jars, pass them into your jar as arguments, and process them yourself. I believe this can be automated as follows:

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  args = new GenericOptionsParser(conf, args).getRemainingArgs();
  //....
}

Then create the job in this manner (haven't verified if works though):

 > elastic-mapreduce --jar s3://mybucket/mycode.jar \
    --args "-D,mapred.reduce.tasks=0"
    --arg s3://mybucket/input \
    --arg s3://mybucket/output

The GenericOptionsParser should automatically transfer the -D and -jobconf parameters into Hadoop's job setup. More details: http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

B) for the hadoop streaming jar, you also just pass the configuration change to the command

> elastic-mapreduce --jobflow j-ABABABABA \
   --stream --jobconf mapred.task.timeout=600000 \
   --mapper s3://mybucket/mymapper.sh \
   --reducer s3://mybucket/myreducer.sh \
   --input s3://mybucket/input \
   --output s3://mybucket/output \
   --jobconf mapred.reduce.tasks=0

More details: https://forums.aws.amazon.com/thread.jspa?threadID=43872 and elastic-mapreduce --help

这篇关于如何指定映射配置&使用Amazon的EMR在CLI中使用定制jar的java选项？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何指定映射配置&使用Amazon的EMR在CLI中使用定制jar的java选项？ [英] How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何指定映射配置&amp;使用Amazon的EMR在CLI中使用定制jar的java选项？ [英] How to specify mapred configurations &amp; java options with custom jar in CLI using Amazon&#39;s EMR?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

如何指定映射配置&使用Amazon的EMR在CLI中使用定制jar的java选项？ [英] How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

登录关闭