如何运行火花java程序 [英] How to run a spark java program

查看:233
本文介绍了如何运行火花java程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经写了火花的Java程序。但是,如何运行以及从UNIX命令行编译它。我一定要包括任何罐子在编制运行

I have written a java program for spark. But how to run and compile it from unix command line. Do I have to include any jar while compiling for running

推荐答案

从官方快速入门指南和纱线启动星火我们得到:

我们将创建一个非常简单的应用程序星火,SimpleApp.java:

We’ll create a very simple Spark application, SimpleApp.java:

/*** SimpleApp.java ***/
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;

public class SimpleApp {
  public static void main(String[] args) {
    String logFile = "$YOUR_SPARK_HOME/README.md"; // Should be some file on your system
    JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
      "$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("a"); }
    }).count();

    long numBs = logData.filter(new Function<String, Boolean>() {
      public Boolean call(String s) { return s.contains("b"); }
    }).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
  }
}

这只是程序计算包含'a'和包含一个文本文件,B数行数。请注意,您需要在安装星火的位置替换$ YOUR_SPARK_HOME。由于与Scala的例子中,我们初始化的 SparkContext 的,虽然我们使用特殊JavaSparkContext类来获得一个Java友好之一。我们还创建了RDDS(由JavaRDD psented重新$ P $),并在其上​​运行转换。最后,我们通过函数创建扩展spark.api.java.function.Function类的火花。 Java编程指南详细描述了这些差异。

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala example, we initialize a SparkContext, though we use the special JavaSparkContext class to get a Java-friendly one. We also create RDDs (represented by JavaRDD) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend spark.api.java.function.Function. The Java programming guide describes these differences in more detail.

要构建程序,我们也写一个Maven 的pom.xml 文件列出星火作为一个依赖。需要注意的是星火的文物都贴上了Scala的版本。

To build the program, we also write a Maven pom.xml file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.

<project>
  <groupId>edu.berkeley</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <name>Simple Project</name>
  <packaging>jar</packaging>
  <version>1.0</version>
  <repositories>
    <repository>
      <id>Akka repository</id>
      <url>http://repo.akka.io/releases</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency> <!-- Spark dependency -->
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>0.9.0-incubating</version>
    </dependency>
  </dependencies>
</project>

如果您还希望从读取Hadoop的HDFS数据,还需要添加Hadoop的客户端上的依赖您的HDFS的版本:

If you also wish to read data from Hadoop’s HDFS, you will also need to add a dependency on hadoop-client for your version of HDFS:

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>...</version>
</dependency>

我们根据典型的Maven目录结构奠定了以下文件:

We lay out these files according to the canonical Maven directory structure:

$ find .
./pom.xml
./src
./src/main
./src/main/java
./src/main/java/SimpleApp.java

现在,我们可以执行使用Maven应用程序:

Now, we can execute the application using Maven:

$ mvn package
$ mvn exec:java -Dexec.mainClass="SimpleApp"
...
Lines with a: 46, Lines with b: 23

然后按照从启动纱线星火步骤:

构建纱启用大会JAR

我们需要一个统一的星火JAR(捆绑了所有需要的依赖),以纱线集群上运行星火作业。这可以通过设置在Hadoop版本和SPARK_YARN环境变量来构建,如下所示:

We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster. This can be built by setting the Hadoop version and SPARK_YARN environment variable, as follows:

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

组装JAR将是这样的:./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar

The assembled JAR will be something like this: ./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar.

构建过程现在还支持新版本的纱(2.2.x的)。见下文。

The build process now also supports new YARN versions (2.2.x). See below.

preparations


  • 构建启用丝组件(见上文)。

  • 组装罐子可以安装到HDFS或本地使用。

  • 您的应用程序code必须打包成一个单独的JAR文件。

如果你想测试出纱线部署模式下,你可以使用当前星火例子。可通过运行产生火花examples_2.10-0.9.0-孵化文件:

If you want to test out the YARN deployment mode, you can use the current Spark examples. A spark-examples_2.10-0.9.0-incubating file can be generated by running:

sbt/sbt assembly 

注意:因为你正在阅读的文档是Spark版本0.9.0-孵化,我们在这里假设你已经下载星火0.9.0-孵化或签出源代码控制。如果您使用的是不同版本的乐驰,在由SBT包命令生成的jar的版本号显然是不同的。

NOTE: since the documentation you’re reading is for Spark version 0.9.0-incubating, we are assuming here that you have downloaded Spark 0.9.0-incubating or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.

配置

大多数的configs是纱线其他部署星火一样。就这些详细信息,请参阅配置页。这些CONFIGS特定于SPARK纱线。

Most of the configs are the same for Spark on YARN as other deploys. See the Configuration page for more information on those. These are configs that are specific to SPARK on YARN.

环境变量:


  • SPARK_YARN_USER_ENV 环境变量添加到纱线推出的星火过程。这可能是一个逗号分隔的环境变量的列表,例如。

  • SPARK_YARN_USER_ENV, to add environment variables to the Spark processes launched on YARN. This can be a comma separated list of environment variables, e.g.
SPARK_YARN_USER_ENV="JAVA_HOME=/jdk64,FOO=bar"

系统属性:


  • spark.yarn.applicationMaster.waitTries 的,属性设置ApplicationMaster等待火花大师的次数,然后又在等待星火背景下intialized尝试次数。默认值是10。

  • spark.yarn.submit.file.replication 的,对于上传到HDFS的应用程序文件的HDFS复制级别。这些包括像火花瓶子,罐子的应用程序,以及任何分布式缓存文件/档案。

  • spark.yarn。preserve.staging.files 的,设置为true,preserve在该月底上演文件(火花罐子,罐子应用,分布式缓存文件)工作而不是删除它们。

  • spark.yarn.scheduler.heartbeat.interval-MS 的,以毫秒为单位的时间间隔,其中星火应用程序主心跳到纱线的ResourceManager。默认为5秒。

  • spark.yarn.max.worker.failures 的,职工失败失败的应用程序之前的最大数量。默认为职工人数要求的2次与3最低。

  • spark.yarn.applicationMaster.waitTries, property to set the number of times the ApplicationMaster waits for the the spark master and then also the number of tries it waits for the Spark Context to be intialized. Default is 10.
  • spark.yarn.submit.file.replication, the HDFS replication level for the files uploaded into HDFS for the application. These include things like the spark jar, the app jar, and any distributed cache files/archives.
  • spark.yarn.preserve.staging.files, set to true to preserve the staged files(spark jar, app jar, distributed cache files) at the end of the job rather then delete them.
  • spark.yarn.scheduler.heartbeat.interval-ms, the interval in ms in which the Spark application master heartbeats into the YARN ResourceManager. Default is 5 seconds.
  • spark.yarn.max.worker.failures, the maximum number of worker failures before failing the application. Default is the number of workers requested times 2 with minimum of 3.

纱线启动星火

确保 HADOOP_CONF_DIR YARN_CONF_DIR 点,其中包含了Hadoop集群的(客户端)配置文件的目录。这将被用来连接到群集,写信给DFS和作业提交到资源管理器。

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster. This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.

有可用于启动纱线火花适用的两个调度模式。

There are two scheduler mode that can be used to launch spark application on YARN.

由YARN客户的启动应用程序的火花与纱独立模式。

该命令来启动YARN客户端如下:

The command to launch the YARN Client is as follows:

SPARK_JAR=<SPARK_ASSEMBLY_JAR_FILE> ./bin/spark-class org.apache.spark.deploy.yarn.Client \
  --jar <YOUR_APP_JAR_FILE> \
  --class <APP_MAIN_CLASS> \
  --args <APP_MAIN_ARGUMENTS> \
  --num-workers <NUMBER_OF_WORKER_MACHINES> \
  --master-class <ApplicationMaster_CLASS>
  --master-memory <MEMORY_FOR_MASTER> \
  --worker-memory <MEMORY_PER_WORKER> \
  --worker-cores <CORES_PER_WORKER> \
  --name <application_name> \
  --queue <queue_name> \
  --addJars <any_local_files_used_in_SparkContext.addJar> \
  --files <files_for_distributed_cache> \
  --archives <archives_for_distributed_cache>

例如:

# Build the Spark assembly JAR and the Spark examples JAR
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Configure logging
$ cp conf/log4j.properties.template conf/log4j.properties

# Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar \
    ./bin/spark-class org.apache.spark.deploy.yarn.Client \
      --jar examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar \
      --class org.apache.spark.examples.SparkPi \
      --args yarn-standalone \
      --num-workers 3 \
      --master-memory 4g \
      --worker-memory 2g \
      --worker-cores 1

# Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command)
# (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.)
$ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_000001/stdout
Pi is roughly 3.13794

以上开始其启动默认的应用程序主纱客户端程序。然后SparkPi将作为应用程序运行法师一个子线程,纱客户会定期轮询状态更新的应用程序主,并将其显示在控制台中。一旦你的应用程序运行完毕后,客户端将退出。

The above starts a YARN Client programs which start the default Application Master. Then SparkPi will be run as a child thread of Application Master, YARN Client will periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.

在此模式下,你的应用程序实际上是那里的应用法师在运行在远程机器上运行。因此,应用程序,涉及地方的互动将无法正常工作,例如火花外壳。

With this mode, your application is actually run on the remote machine where the Application Master is run upon. Thus application that involve local interaction will not work well, e.g. spark-shell.

这篇关于如何运行火花java程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆