星火提交并自动上传罐子集群? [英] Spark submit does automatically upload the jar to cluster?

查看:322
本文介绍了星火提交并自动上传罐子集群?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从提交本地计算机终端火花应用到我的集群。
我使用 - 主纱集群。的我需要在我的群集运行的驱动程序也一样,不是我做的提交申请即我的本地机器上机

在我提供的路径,应用程序JAR这是在我的本地机器,将自动引发提交上传到我的集群?

我用

 斌/火花提交
--class com.my.application.XApp
--master纱线集群--executor内存百米
--num-50执行人/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000

和得到错误

 诊断:java.io.FileNotFoundException:文件文件:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-不存在

在文档,<一个href=\"http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit\" rel=\"nofollow\">http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit


  

    

高级依赖管理:当使用火花提交的
    与任何广口瓶一起应用的jar包含在--jars选项
    会自动转移到集群。


  

但好像不是!


解决方案

我看你引述的火花提交页面,但我会花更多的时间对的运行的纱线页面上的火花。底线,看看:


  

有可用于启动火花2部署模式
  纱线的应用程序。在纱群集模式,星火驱动程序运行
  这是由纱线管理上的应用程序主进程内
  群集,客户端可以启动应用程序后自行消失。
  在纱线客户机模式时,驱动程序在客户端进程中运行,并且
  应用程序主仅用于从纱线请求资源。


此外,您注意,我需要在我的群集中运行的驱动程序也一样,不是我做的提交申请即我的本地机器的机器

所以,我同意你的看法,你是正确的运行 - 主纱集群而不是 - 主纱的客户端

(一评论指出什么可能只是一个语法错误,你降至assembly.jar但我认为这将适用,以及...)

一些关于非纱实现的基本假设的变化,当纱线介绍,主要涉及类路径和需要推动罐子工人很多。

从Apache的星火用户列表中的邮件:


  

YARN群集模式。星火提交并上传你的罐子到集群。
  特别是,它使罐子在HDFS所以你的驱动程序可以只是读
  从那里。正如在其他的部署,执行人拉离罐子
  该驱动程序。


所以最后,从阿帕奇星火YARN DOC


  

确保HADOOP_CONF_DIR或YARN_CONF_DIR指向目录
  其中包含(客户端)的配置文件为Hadoop
  簇。这些CONFIGS用于写入HDFS,并连接到
  纱的ResourceManager。



请注意:我只看到你添加一个单独的JAR,如果有需要添加其他的jar还有的这样做的特别注意,纱


  

在纱线集群模式下,驾驶者不同的机器上运行
  客户端,因此SparkContext.addJar将无法正常工作与文件开箱
  的本地客户端。为了使客户机上的文件提供给
  SparkContext.addJar,包括他们在推出--jars选项
  命令。


在链接页面有一些例子。


当然,你下载或内置纱特异性版星火。


背景,在使用一个独立的集群部署火花提交和期权--deploy模式的集群,是你需要确保每个工作节点可以访问所有的依赖,星火不会他们推到集群。这是因为在独立集群模式,如星火作业管理器,你不知道哪个节点上驱动程序将运行!但是,这并不适用于你的情况。

但是,如果我可以,这取决于您所上传的罐子的大小,我仍然会明确提出的罐子每个节点上,还是全球通用通过HDFS,从文档的另一个原因:

从<一个href=\"http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management\"相对=nofollow>高级依赖管理的,它似乎present两全其美的,也为手动推你的罐子到所有节点的重要原因:


  

地方 - 一个URI开始与地方:预计/存在作为本地
  每个工作节点上的文件。 这意味着,没有网络IO会
  发生
,并非常适用于那些推到每个大文件/ JAR文件
  工人,或通过NFS,GlusterFS等共享。


不过,我认为本地:/ ...将变为HDFS:/ ...不知道在那一个

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine

When I provide the path to application jar which is in my local machine, would spark-submit automatically upload it to my Cluster?

I'm using

    bin/spark-submit 
--class com.my.application.XApp 
--master yarn-cluster --executor-memory 100m 
--num-executors 50 /Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 
1000

and getting error

Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101- does not exist

In Documentation ,http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit

Advanced Dependency Management When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.

But seems like it does not !

解决方案

I see you are quoting the spark-submit page from Spark Docs but I would spend a lot more time on the Running Spark on YARN page. Bottom-line, look at:

There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Further you note, "I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine"

So I agree with you that you are right to run --master yarn-cluster instead of --master yarn-client

(and one comment notes what might just be a syntax error where you dropped "assembly.jar" but I think this will apply as well...)

Some of the basic assumptions about non-YARN implementations change a lot when YARN is introduced, mostly related to Classpaths and the need to push jars to the workers.

From an email on the Apache Spark User list:

YARN cluster mode. Spark submit does upload your jars to the cluster. In particular, it puts the jars in HDFS so your driver can just read from there. As in other deployments, the executors pull the jars from the driver.

So finally, from the Apache Spark YARN doc:

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager.


NOTE: I only see you adding a single JAR, if there's a need to add other JARs there's a special note about doing that with YARN:

In yarn-cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.

That page in the link has some examples.


And of course you downloaded or built the YARN-specific version of Spark.


Background, in a standalone cluster deployment using spark-submit and the option --deploy-mode cluster, yes you do need to make sure every worker node has access to all the dependencies, Spark will not push them to the cluster. This is because in "standalone cluster" mode with Spark as the job manager, you don't know which node the driver will run on! But that doesn't apply to your case.

But if I could, depending on the size of the jars you are uploading, I would still explicitly put the jars on each node, or "globally available" via HDFS, for another reason from the docs:

From Advanced Dependency Management, it seems to present the best of both worlds, but also a great reason for manually pushing your jars out to all nodes:

local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

But I assume that local:/... would change to hdfs:/ ... not sure on that one.

这篇关于星火提交并自动上传罐子集群?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆