尝试使用AWS SDK for Java在EMR上运行Spark，但它会跳过存储在S3上的远程JAR [英] Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

查看：556 发布时间：2018/11/19 14:20:04 apache-spark amazon-s3 amazon-ec2 jar amazon-emr

本文介绍了尝试使用AWS SDK for Java在EMR上运行Spark，但它会跳过存储在S3上的远程JAR的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用SDK for Java在EMR上运行Spark，但是我遇到了使用我在S3上存储的JAR获取spark-submit的问题。以下是相关代码：

I'm trying to run Spark on EMR using the SDK for Java, but I'm having issues getting the spark-submit to use a JAR that I have stored on S3. Here is the relevant code:

public String launchCluster() throws Exception {
    StepFactory stepFactory = new StepFactory();

    // Creates a cluster flow step for debugging
    StepConfig enableDebugging = new StepConfig().withName("Enable debugging")
            .withActionOnFailure("TERMINATE_JOB_FLOW")
            .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

    // Here is the original code before I tried command-runner.jar. 
    // When using this, I get a ClassNotFoundException for 
    // org.apache.spark.SparkConf. This is because for some reason, 
    // the super-jar that I'm generating doesn't include apache spark. 
    // Even so, I believe EMR should already have Spark installed if
    // I configure this correctly...

    //        HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
    //                .withJar(JAR_LOCATION)
    //                .withMainClass(MAIN_CLASS);

    HadoopJarStepConfig runExampleConfig = new HadoopJarStepConfig()
            .withJar("command-runner.jar")
            .withArgs(
                    "spark-submit",
                    "--master", "yarn",
                    "--deploy-mode", "cluster",
                    "--class", SOME_MAIN_CLASS,
                    SOME_S3_PATH_TO_SUPERJAR,
                    "-useSparkLocal", "false"
            );

    StepConfig customExampleStep = new StepConfig().withName("Example Step")
            .withActionOnFailure("TERMINATE_JOB_FLOW")
            .withHadoopJarStep(runExampleConfig);

    // Create Applications so that the request knows to launch
    // the cluster with support for Hadoop and Spark.

    // Unsure if Hadoop is necessary...
    Application hadoopApp = new Application().withName("Hadoop");
    Application sparkApp = new Application().withName("Spark");

    RunJobFlowRequest request = new RunJobFlowRequest().withName("spark-cluster")
            .withReleaseLabel("emr-5.15.0")
            .withSteps(enableDebugging, customExampleStep)
            .withApplications(hadoopApp, sparkApp)
            .withLogUri(LOG_URI)
            .withServiceRole("EMR_DefaultRole")
            .withJobFlowRole("EMR_EC2_DefaultRole")
            .withVisibleToAllUsers(true)
            .withInstances(new JobFlowInstancesConfig()
                    .withInstanceCount(3)
                    .withKeepJobFlowAliveWhenNoSteps(true)
                    .withMasterInstanceType("m3.xlarge")
                    .withSlaveInstanceType("m3.xlarge")
            );
    return result.getJobFlowId();
}

步骤完成且没有错误，但实际上并没有输出任何内容.. 。当我检查日志时， stderr 包含以下内容：
警告：跳过远程jar s3：// somebucket / myservice- 1.0-super.jar。

和

18/07/17 22:08:31 WARN客户：两者都没有。 yarn.jars也设置了spark.yarn.archive，回退到SPARK_HOME下的库。

The steps complete without error, but it doesn't actually output anything...when I check the logs, stderr includes the following
Warning: Skip remote jar s3://somebucket/myservice-1.0-super.jar.
and
18/07/17 22:08:31 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

我不确定基于日志的问题是什么。我相信我正在群集上正确安装Spark。另外，给出一些上下文 - 当我使用 withJar 直接使用存储在S3上的超级JAR而不是命令运行器（并且没有 withArgs ），它正确地抓住了JAR，但它没有安装Spark - 我得到了SparkConf的ClassNotFoundException（以及JavaSparkContext，取决于我的Spark作业代码首先尝试创建的内容）。

I'm not sure what the issue is based on the log. I believe I am installing Spark correctly on the cluster. Also, to give some context - when I use withJar directly with the super-JAR stored on S3 instead of command-runner (and without withArgs), it correctly grabs the JAR, but then it doesn't have Spark installed - I get a ClassNotFoundException for SparkConf (and JavaSparkContext, depending on what my Spark job code tries to create first).

任何指针都会非常感激！

Any pointers would be much appreciated!

尝试使用AWS SDK for Java在EMR上运行Spark，但它会跳过存储在S3上的远程JAR [英] Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

问题描述

推荐答案

相关文章

Java相关最新文章

热门教程

热门工具

登录关闭

尝试使用AWS SDK for Java在EMR上运行Spark，但它会跳过存储在S3上的远程JAR [英] Trying to run Spark on EMR using the AWS SDK for Java, but it skips the remote JAR stored on S3

问题描述

推荐答案

相关文章

Java相关最新文章

热门教程

热门工具

登录 关闭

登录关闭