Spark提交YARN模式的HADOOP_CONF_DIR内容 [英] Spark submit YARN mode HADOOP_CONF_DIR contents

查看:2083
本文介绍了Spark提交YARN模式的HADOOP_CONF_DIR内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我正在从我的开发机器启动spark-submit。

我试图在Hadoop集群上使用spark submit在YARN模式上启动spark任务。 p>

根据运行Spark在YARN 文档中,我应该为env var HADOOP_CONF_DIR YARN_CONF_DIR 。这就是棘手的问题:如果我将任务发送到远程YARN服务,为什么这些文件夹必须存在于本地计算机上?这是否意味着spark-submit必须位于集群内,因此我无法远程启动spark任务?如果没有,我应该用什么来填充这些文件夹?我应该从任务管理器服务驻留的YARN集群节点复制hadoop配置文件夹吗? 1)当提交作业时Spark需要知道它连接的是什么。文件被解析并且正在使用所需的配置来连接到Hadoop集群。请注意,在文档中他们说它是客户端配置(在第一句中是正确的),这意味着您实际上不需要连接到文件中的群集的所有配置(要连接到非 - 至少需要以下配置:




  • fs.defaultFS (如果您打算从HDFS读取)

  • dfs.nameservices

  • yarn.resourcemanager.hostname yarn.resourcemanager.address

  • yarn.application.classpath

  • (其他可能需要,具体取决于配置)


您可以通过在提交作业的代码中设置相同的设置来避免拥有文件:

  SparkConf sparkConfiguration = new SparkConf(); 
sparkConfiguration.set(spark.hadoop.fs.defaultFS,...);
...

2)Spark提交可以位于任何机器上,不一定在只要知道如何连接群集(甚至可以从Eclipse运行提交,无需安装任何内容,但是与Spark相关的项目依赖项)。



<您应该使用以下内容填充配置文件夹:


  • core-site.xml

  • yarn-site.xml

  • hdfs-site.xml
  • mapred-site.xml



从服务器复制这些文件是最简单的方法。在你可以删除一些spark-submit不需要的配置或者可能是安全敏感的。


I am trying to launch a spark task on a hadoop cluster using spark submit on YARN mode.

I am launching spark-submit from my development machine.

According to Running Spark On YARN docs, I am supposed to provide a path for the hadoop cluster configuration on the env var HADOOP_CONF_DIR or YARN_CONF_DIR. This is where it gets tricky: Why do these folders must exist on my local machine if I am sending the task to a remote YARN service? Does this mean that spark-submit must be located inside the cluster and therefore I cannot launch a spark task remotely? If not, what should I populate these folders with? Should I copy the hadoop configuration folder from the YARN cluster node where the task manager service resides?

解决方案

1) When submitting a job Spark needs to know what it is connecting to. The files are parsed and required configuration is being used to connect to Hadoop cluster. Note that in documentation they say that it is client side configuration (right in the first sentence), meaning that you actually do not need all the configurations to connect to the cluster in the file (to connect to non-secured Hadoop cluster with minimalist configuration) you will need at least the following configs present:

  • fs.defaultFS (in case you intent to read from HDFS)
  • dfs.nameservices
  • yarn.resourcemanager.hostname or yarn.resourcemanager.address
  • yarn.application.classpath
  • (others might be required, depending on the configuration)

You can avoid having files, by setting the same settings in the code of the job you are submitting:

SparkConf sparkConfiguration = new SparkConf();
sparkConfiguration.set("spark.hadoop.fs.defaultFS", "...");
...

2) Spark submit can be located on any machine, not necessarily on the cluster, as long as it knows how to connect to the cluster (you can even run the submission from Eclipse, without installing anything, but project dependencies, related to Spark).

3) You should populate the configuration folders with:

  • core-site.xml
  • yarn-site.xml
  • hdfs-site.xml
  • mapred-site.xml

Copying those files from the server is an easiest approach to start with. After you can remove some configuration which is not required by spark-submit or may be security-sensitive.

这篇关于Spark提交YARN模式的HADOOP_CONF_DIR内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆