在Maven中构建Spark Uber jar,而不是多个Uber jar(每个模块一个) [英] Build Spark Uber jar in Maven instead of multiple Uber jars (one per module)

查看:134
本文介绍了在Maven中构建Spark Uber jar,而不是多个Uber jar(每个模块一个)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Spark/scala中编写了一个脚本来处理大型图形,并且可以在Spark源代码项目(下载的版本1.2.1)中的Intellij 14上编译/运行它.我现在想要做的是构建Uber jar,以创建一个可执行文件,我可以将其上传到EC2并运行.我知道应该为该项目创建胖子的插件.但是我不知道该怎么做-两个插件都只是为每个模块创建超级" jar,而不是主jar.

I've written a script in Spark/scala to process a large graph, and can compile/run it on Intellij 14 within the Spark source-code project (downloaded version 1.2.1). What I'm trying to do now is build the Uber jar to create a single executable file I can upload to EC2 and run. I'm aware of the plugins that are supposed to create the fat jar for the project. However I can't figure out how to do this - both plugins just create 'uber' jars for each module rather than a main jar.

要清楚一点:我已经尝试了Maven-Assembly和Maven-Shade插件,并且每次创建10个主要jar(分别称为带有依赖项的jar"或"Uber")而不是一个主要jar.它正在为core_2.10创建一个Uber,为streaming_2.10创建一个Uber,为graphx_2.10创建一个Uber,等等.

To be clear: I have tried both the Maven-Assembly and the Maven-Shade plugins, and each time it creates 10 main jars (called either 'jar with dependencies' or Uber' respectively) rather than one main jar. It is creating an Uber for core_2.10, another for streaming_2.10, another for graphx_2.10, and so on.

我尝试更改Maven插件的设置和配置.例如,我尝试将其添加到Shade插件中:

I have tried altering the settings and configurations of the Maven plugins. For example, I tried adding this to the Shade plugin:

<configuration>
  <shadedArtifactAttached>false</shadedArtifactAttached>
  <artifactSet>
    <includes>
      <include>org.spark-project.spark:unused</include>
    </includes>
  </artifactSet>
</configuration>
<executions>
  <execution>
    <phase>package</phase>
    <goals>
      <goal>shade</goal>
    </goals>
  </execution>
</executions>

我还尝试了替代的Maven-assembly插件:

I've also tried the alternative Maven-assembly plugin:

<configuration>
  <descriptorRefs>
    <descriptorRef>jar-with-dependencies</descriptorRef>
  </descriptorRefs>
  <archive>
    <manifest>
    <mainClass>org.apache.spark.examples.graphx.PageRankGraphX</mainClass>
    </manifest>
  </archive>

</configuration>
<executions>
  <execution>
  <id>make-assembly</id>
  <phase>package</phase> 
  <goals>
    <goal>single</goal>
  </goals>
  </execution>
</executions>

我还要指出,我已经尝试了在线提供的插件设置的多种变体,但没有一个奏效.很明显,项目设置有问题.但是,这不是我的项目-这是Apache Spark的源代码安装,所以我不知道为什么这么不可能构建.

I would also point out that I've tried a number of variations on the plugin settings available online, but none has worked. It's fairly obvious that something is wrong with the project set-up. However, this isn't my project - it's a source-code installation of Apache Spark, so I have no idea why it would be so impossible to build.

我正在使用命令行创建内部版本

I am creating the build with the command line

mvn package -DskipTests

我将不胜感激.

进一步的调查显示,最终模块中的许多Spark模块依赖项在pom中设置为提供"(可以是org.spark.graphx,org.spark.streaming,org.spark.mlib等). .但是,为该最终"模块(示例模块)运行jar时,无法在那些模块(即那些依赖项)中找到类.也许有更多经验的人知道这意味着什么.

Further investigation shows that many of the Spark module dependencies in the final module are set as 'provided' in the pom (that would be org.spark.graphx, org.spark.streaming, org.spark.mlib, etc). However, running the jar for this 'final' module (the examples module) fails to find classes in those modules (ie. those dependencies). Perhaps someone with more experience knows what this means.

推荐答案

您正在assembly模块中寻找mvn package的乘积.您无需添加或修改内部版本.

You are looking for the product of mvn package in the assembly module. You do not need to add to or modify the build.

但是,捆绑uber jar可能不是在EC2上设置和运行集群的正确方法. ec2中有一个用于启动群集的脚本.然后通常在集群中spark-submit您的应用程序(不包含Spark/Hadoop类).

However bundling an uber jar may not be the right way to set up and run a cluster on EC2. There is a script in ec2 for turning up a cluster. And then you generally spark-submit your app (which includes no Spark/Hadoop classes) in the cluster.

这篇关于在Maven中构建Spark Uber jar,而不是多个Uber jar(每个模块一个)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆