使用已安装的spark和maven将Spark Scala Program编译为jar文件 [英] Compiling Spark Scala Program into jar file using installed spark and maven

查看:217
本文介绍了使用已安装的spark和maven将Spark Scala Program编译为jar文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

仍然试图熟悉maven并将我的源代码编译成jar文件以进行spark-submit。我知道如何使用IntelliJ,但想了解这实际上是如何工作的。我有一个EC2服务器,已经安装了所有最新的软件,如spark和scala,并且有我想用maven编译的示例SparkPi.scala源代码。我的愚蠢问题首先是,我可以使用我安装的软件来构建代码,而不是从maven存储库中检索依赖项,如何从基本的pom.xml模板开始添加适当的需求。我不完全理解maven正在做什么,我怎么才能测试我的源代码的编译?
据我了解,我只需要有标准的目录结构 src / main / scala 然后想运行 mvn package 。此外,我想用maven而不是sbt进行测试。

Still trying to get familiar with maven and compiling my source code into jar files for spark-submit. I know how to use IntelliJ for this but would like to understand how this actually works. I have an EC2 server with all of the latest software such as spark and scala already installed and have the example SparkPi.scala source code I would like to now compile with maven. My silly questions are firstly, can I just use my installed software for building the code rather than retrieving dependencies from maven repository and how do I start off with a basic pom.xml template for adding the appropriate requirements. I don't fully understand what maven is exactly doing and how can I just test a compilation for my source code? As I understand it, I just need to have the standard directory structure src/main/scala and then want to run mvn package. Also I would like to test with maven rather than sbt.

推荐答案

除了@Krishna,
如果你有 mvn项目,在 pom.xml 上使用 mvn clean package 。确保在 pom.xml 中有以下 build 以使 fat-jar 。 (这是我的情况,我是如何制作jar的)

Addition to @Krishna, If you have mvn project, use mvn clean package on pom.xml. Make sure you have the following build in your pom.xml to make fat-jar. (This is my case, how I'm making jar)

<build><sourceDirectory>src</sourceDirectory>
        <plugins><plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.0</version>
            <configuration>
                <source>1.7</source>
                <target>1.7</target>
            </configuration>
        </plugin>
            <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>2.4</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>assemble-all</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin></plugins>
    </build>

更多细节:链接
如果你有 sbt项目,请使用 sbt清洁程序集使 fat-jar 。为此,您需要以下配置,例如 build.sbt

For more detail: link If you have sbt project, use sbt clean assemblyto make fat-jar. For that you need the following config, as an example in build.sbt

assemblyJarName := "WordCountSimple.jar"
//
val meta = """META.INF(.)*""".r

assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs@_*) => MergeStrategy.first
  case PathList(ps@_*) if ps.last endsWith ".html" => MergeStrategy.first
  case n if n.startsWith("reference.conf") => MergeStrategy.concat
  case n if n.endsWith(".conf") => MergeStrategy.concat
  case meta(_) => MergeStrategy.discard
  case x => MergeStrategy.first
}

此外 plugin.sbt 喜欢:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

欲了解更多信息,请参阅这个这个

For more see this and this.

直到这里,主要目标是在目标文件夹中获取所有依赖项的fat-jar。使用该jar在集群中运行如下:

Till here main goal is to get fat-jar with all dependencies in target folder. Use that jar to run in cluster like this:

hastimal@nm:/usr/local/spark$ ./bin/spark-submit --class  com.hastimal.wordcount --master yarn-cluster  --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g  --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300  --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200  --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/wordcount.jar inputRDF/data_all.txt /output 

这里我有 inputRDF / data_all.txt / output 是两个参数。同样在工具的角度来看,我在 Intellij 中构建为IDE。

Here I have inputRDF/data_all.txt /output are two args. Also in tool point of view I'm building in Intellijas IDE.

这篇关于使用已安装的spark和maven将Spark Scala Program编译为jar文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆