如何编译/打包Spark 2.0项目与外部jar和Maven [英] How to compile/package Spark 2.0 project with external jars and Maven

查看:2577
本文介绍了如何编译/打包Spark 2.0项目与外部jar和Maven的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

自版本2.0开始,Apache Spark与一个包含.jar文件的文件夹jars捆绑在一起。显然,Maven将在发布时下载所有这些jar:

  mvn -e package 

,因为为了提交申请

  spark -submit --class DataFetch target / DataFetch-1.0-SNAPSHOT.jar 

DataFetch- 1.0-SNAPSHOT.jar



因此,第一个问题很简单,我如何利用这些现有的jar?第二个问题是相关的,儿子我第一次尝试Maven下载jar,我有以下输出:

  [INFO]错误堆栈跟踪开启。 
[INFO]扫描项目...
[INFO]
[INFO] ----------------------- -------------------------------------------------
[INFO] BuildingDataFetch1.0-SNAPSHOT
[INFO] ------------------------------ ------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources(default-resources)@DataFetch ---
[debug] execute contextualize
[INFO]使用'UTF-编码以复制过滤的资源。
[INFO] skip不存在resourceDirectory / root / sparkTests / scalaScripts / DataFetch / src / main / resources
[INFO]
[INFO] --- maven-compiler-plugin:编译(默认编译)@ DataFetch - -
[INFO]没有编译源
[INFO]
[INFO] --- maven-resources-plugin:2.5:testResources -testResources)@ DataFetch ---
[debug] execute contextualize
[INFO]使用'UTF-8'编码来复制过滤的资源。
[INFO] skip不存在resourceDirectory / root / sparkTests / scalaScripts / DataFetch / src / test / resources
[INFO]
[INFO] --- maven-compiler-plugin: testCompile(default-testCompile)@ DataFetch ---
[INFO]没有编译源
[INFO]
[INFO] --- maven-surefire-plugin:2.10:test -test)@ DataFetch ---
[INFO]没有要运行的测试。
[INFO] Surefire报告目录:/ root / sparkTests / scalaScripts / DataFetch / target / surefire-reports

----------------- --------------------------------------
TESTS
--- -------------------------------------------------- -

结果:

测试运行:0,失败:0,错误:0,跳过:0

[INFO]
[INFO] --- maven-jar-plugin:2.3.2:jar(default-jar)@ DataFetch ---
[警告] JAR将为空 - 没有内容被标记为包含!
[INFO] ------------------------------------------- -----------------------------
[INFO] BUILD SUCCESS
[INFO] ----- -------------------------------------------------- -----------------
[INFO]总时间:4.294s
[INFO]完成时间:Sep 9月28日17:41:29 PYT 2016
[INFO]最终记忆:14M / 71M
[INFO] ------------------------------- -----------------------------------------



这是我的pom.xml文件

 < project xmlns =http://maven.apache.org/POM/4.0.0xmlns:xsi =http://www.w3.org/2001/XMLSchema-instance
xsi :schemaLocation =http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd\">
< modelVersion> 4.0.0< / modelVersion>

< groupId> com.spark.pg< / groupId>
< artifactId> DataFetch< / artifactId>
< version> 1.0-SNAPSHOT< / version>
< packaging> jar< / package>
< name>DataFetch< / name>

< properties>
< project.build.sourceEncoding> UTF-8< /project.build.sourceEncoding>
< / properties>

< dependencies>
< dependency>
< groupId> org.apache.spark< / groupId>
< artifactId> spark-core_2.11< / artifactId>
< version> 2.0.0< / version>
< / dependency>
< / dependencies>

< build>
< plugins>
< plugin>
< groupId> org.apache.maven.plugins< / groupId>
< artifactId> maven-compiler-plugin< / artifactId>
< version> 3.0< / version>
< / plugin>
< / plugins>
< / build>

< / project>

如果需要更多信息,请随时索取。

解决方案

我不知道我是否明白你的问题,但我试图回答。



基于 Spark捆绑应用程序的依赖关系文档:


创建程序集时,列出Spark和Hadoop
dependencies;这些不需要捆绑,因为它们在运行时由
集群管理器提供。


您可以将范围设置为


< groupId> org.apache.spark< / groupId>
< artifactId> spark-core_2.11< / artifactId>
< version> $ {spark.version}< / version>
<! - 添加此范围 - >
< scope>提供< / scope>
< / dependency>

第二个是我注意到maven build会创建一个空的JAR。

  [警告] JAR将为空 - 没有内容被标记为包含! 

如果您有任何其他依赖项,则应将这些依赖项打包到最终的jar归档文件中。



您可以在pom.xml中执行以下操作,然后运行 mvn package

 < plugin> 
< groupId> org.apache.maven.plugins< / groupId>
< artifactId> maven-assembly-plugin< / artifactId>
< version> 2.6< / version>
< configuration>
<! - 包含项目依赖关系 - >
< descriptorRefs>
< descriptorRef> jar-with-dependencies< / descriptorRef>
< / descriptorRefs>
< archive>
< manifest>
< mainClass> YOUR_MAIN_CLASS< / mainClass>
< / manifest>
< / archive>
< / configuration>
< executions>
< execution>
< id> make-assembly< / id>
< phase>包< / phase>
< goals>
< goal> single< / goal>
< / goal>
< / execution>
< / executions>
< / plugin>

Maven日志应与构建jar打印行:

  [INFO] --- maven-assembly-plugin:2.4.1:single(make-assembly)@ dateUtils --- 
[INFO] /target/APPLICATION_NAME-jar-with-dependencies.jar

在目标文件夹中的maven打包阶段之后应该会看到 DataFetch-1.0-SNAPSHOTjar-with-dependencies.jar ,您可以使用 spark-submit

Since version 2.0, Apache Spark is bundled with a folder "jars" full of .jar files. Obviously Maven will download all these jars when issuing:

mvn -e package

because in order to submit an application with

spark-submit --class DataFetch target/DataFetch-1.0-SNAPSHOT.jar

the DataFetch-1.0-SNAPSHOT.jar is needed.

So, the first question is straightforward, how can I take advantage of these existing jars?. The second question is related, son I've tried the first time with Maven downloading the jars, I've got the following output:

[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building "DataFetch" 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @DataFetch ---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory     /root/sparkTests/scalaScripts/DataFetch/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.0:compile (default-compile) @ DataFetch -    --
[INFO] No sources to compile
[INFO]
[INFO] --- maven-resources-plugin:2.5:testResources (default-testResources) @ DataFetch ---
[debug] execute contextualize
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory/root/sparkTests/scalaScripts/DataFetch/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.0:testCompile (default-testCompile) @ DataFetch ---
[INFO] No sources to compile
[INFO]
[INFO] --- maven-surefire-plugin:2.10:test (default-test) @ DataFetch ---
[INFO] No tests to run.
[INFO] Surefire report directory: /root/sparkTests/scalaScripts/DataFetch/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ DataFetch---
[WARNING] JAR will be empty - no content was marked for inclusion!
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.294s
[INFO] Finished at: Wed Sep 28 17:41:29 PYT 2016
[INFO] Final Memory: 14M/71M
[INFO] ------------------------------------------------------------------------

And here is my pom.xml file

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.spark.pg</groupId>
  <artifactId>DataFetch</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>
  <name>"DataFetch"</name>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.0.0</version>
        </dependency>
    </dependencies>

     <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
            </plugin>
        </plugins>
    </build>  

</project>

If more information is needed, please don't hesitate to ask for it.

解决方案

I am not sure whether I understand your problem, but I try to answer.

Based on Spark Bundling Your Application’s Dependencies documentation:

When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.

You can set scope to provided in maven pom.xml file

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>${spark.version}</version>
    <!-- add this scope -->
    <scope>provided</scope>
</dependency>

The second think I noticed is that maven build creates empty JAR.

[WARNING] JAR will be empty - no content was marked for inclusion!

If you have any other dependencies, you should package these dependencies into final jar archive file.

You can do something like below in pom.xml and run mvn package:

    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.6</version>
        <configuration>
            <!-- package with project dependencies -->
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
              <manifest>
                <mainClass>YOUR_MAIN_CLASS</mainClass>
              </manifest>
            </archive>
            </configuration>
            <executions>
              <execution>
                <id>make-assembly</id>
                <phase>package</phase>
                <goals>
                    <goal>single</goal>
                </goals>
              </execution>
            </executions>
    </plugin>

Maven log should print line with building jar:

[INFO] --- maven-assembly-plugin:2.4.1:single (make-assembly) @ dateUtils ---
[INFO] Building jar: path/target/APPLICATION_NAME-jar-with-dependencies.jar

After maven packaging phase in the target folder you should see DataFetch-1.0-SNAPSHOTjar-with-dependencies.jar and you can sumbit this jar with spark-submit

这篇关于如何编译/打包Spark 2.0项目与外部jar和Maven的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆