使用 spark-submit YARN 集群模式时缺少 hive-site [英] Missing hive-site when using spark-submit YARN cluster mode

查看:19
本文介绍了使用 spark-submit YARN 集群模式时缺少 hive-site的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 HDP 2.5.3,我一直在尝试调试一些 YARN 容器类路径问题.

由于 HDP 包含 Spark 1.6 和 2.0.0,因此存在一些版本冲突

我支持的用户能够在 YARN client 模式下成功使用 Spark2 和 Hive 查询,但不能从 cluster 模式下,他们会收到有关找不到表的错误,或类似的错误那是因为未建立 Metastore 连接.

我猜是设置 --driver-class-path/etc/spark2/conf:/etc/hive/conf 或传递 --files/etc/spark2/conf/hive-site.xmlspark-submit 之后可以工作,但为什么 hive-site.xml 没有从 conf<加载/code> 文件夹?

根据 Hortonworksdocs,说 hive-site 应该放在 $SPARK_HOME/conf 中,它是...

我看到 hdfs-site.xmlcore-site.xml,以及其他属于 HADOOP_CONF_DIR 的文件,例如,这是来自 YARN UI 容器信息.

2232355 4 drwx------ 2 yarn hadoop 4096 Aug 2 21:59 ./__spark_conf__2232379 4 -r-x------ 1 yarn hadoop 2358 Aug 2 21:59 ./__spark_conf__/topology_script.py2232381 8 -r-x------ 1 yarn hadoop 4676 Aug 2 21:59 ./__spark_conf__/yarn-env.sh2232392 4 -r-x------ 1 yarn hadoop 569 Aug 2 21:59 ./__spark_conf__/topology_mappings.data2232398 4 -r-x------ 1 yarn hadoop 945 Aug 2 21:59 ./__spark_conf__/taskcontroller.cfg2232356 4 -r-x------ 1 yarn hadoop 620 Aug 2 21:59 ./__spark_conf__/log4j.properties2232382 12 -r-x------ 1 yarn hadoop 8960 Aug 2 21:59 ./__spark_conf__/hdfs-site.xml2232371 4 -r-x------ 1 yarn hadoop 2090 Aug 2 21:59 ./__spark_conf__/hadoop-metrics2.properties2232387 4 -r-x------ 1 yarn hadoop 662 Aug 2 21:59 ./__spark_conf__/mapred-env.sh2232390 4 -r-x------ 1 yarn hadoop 1308 Aug 2 21:59 ./__spark_conf__/hadoop-policy.xml2232399 4 -r-x------ 1 yarn hadoop 1480 Aug 2 21:59 ./__spark_conf__/__spark_conf__.properties2232389 4 -r-x------ 1 yarn hadoop 1602 Aug 2 21:59 ./__spark_conf__/health_check2232385 4 -r-x------ 1 yarn hadoop 913 Aug 2 21:59 ./__spark_conf__/rack_topology.data2232377 4 -r-x------ 1 yarn hadoop 1484 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml2232383 4 -r-x------ 1 yarn hadoop 1020 Aug 2 21:59 ./__spark_conf__/commons-logging.properties2232357 8 -r-x------ 1 yarn hadoop 5721 Aug 2 21:59 ./__spark_conf__/hadoop-env.sh2232391 4 -r-x------ 1 yarn hadoop 281 Aug 2 21:59 ./__spark_conf__/slaves2232373 8 -r-x------ 1 yarn hadoop 6407 Aug 2 21:59 ./__spark_conf__/core-site.xml2232393 4 -r-x------ 1 yarn hadoop 812 Aug 2 21:59 ./__spark_conf__/rack-topology.sh2232394 4 -r-x------ 1 yarn hadoop 1044 Aug 2 21:59 ./__spark_conf__/ranger-hdfs-security.xml2232395 8 -r-x------ 1 yarn hadoop 4956 Aug 2 21:59 ./__spark_conf__/metrics.properties2232386 8 -r-x------ 1 yarn hadoop 4221 Aug 2 21:59 ./__spark_conf__/task-log4j.properties2232380 4 -r-x------ 1 yarn hadoop 64 Aug 2 21:59 ./__spark_conf__/ranger-security.xml2232372 20 -r-x------ 1 yarn hadoop 19975 Aug 2 21:59 ./__spark_conf__/yarn-site.xml2232397 4 -r-x------ 1 yarn hadoop 1006 Aug 2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml2232374 4 -r-x------ 1 yarn hadoop 29 Aug 2 21:59 ./__spark_conf__/yarn.exclude2232384 4 -r-x------ 1 yarn hadoop 1606 Aug 2 21:59 ./__spark_conf__/container-executor.cfg2232396 4 -r-x------ 1 yarn hadoop 1000 Aug 2 21:59 ./__spark_conf__/ssl-server.xml2232375 4 -r-x------ 1 yarn hadoop 1 Aug 2 21:59 ./__spark_conf__/dfs.exclude2232359 8 -r-x------ 1 yarn hadoop 7660 Aug 2 21:59 ./__spark_conf__/mapred-site.xml2232378 16 -r-x------ 1 yarn hadoop 14474 Aug 2 21:59 ./__spark_conf__/capacity-scheduler.xml2232376 4 -r-x------ 1 yarn hadoop 884 Aug 2 21:59 ./__spark_conf__/ssl-client.xml

如您所见,hive-site 不存在,尽管我确实有 conf/hive-site.xml 供 spark-submit 使用

[spark@asthad006 conf]$ pwd &&ls -l/usr/hdp/2.5.3.0-37/spark2/conf共 32 个-rw-r--r-- 1 spark spark 742 Mar 6 15:20 hive-site.xml-rw-r--r-- 1 spark spark 620 Mar 6 15:20 log4j.properties-rw-r--r-- 1 spark spark 4956 Mar 6 15:20 metrics.properties-rw-r--r-- 1 spark spark 824 Aug 2 22:24 spark-defaults.conf-rw-r--r-- 1 spark spark 1820 Aug 2 22:24 spark-env.sh-rwxr-xr-x 1 spark spark 244 Mar 6 15:20 spark-thrift-fairscheduler.xml-rw-r--r-- 1 hive hadoop 918 Aug 2 22:24 spark-thrift-sparkconf.conf

所以,我不认为我应该将 hive-site 放在 HADOOP_CONF_DIR 中,因为 HIVE_CONF_DIR 是分开的,但我的问题是我们如何让 Spark2获取 hive-site.xml 而无需在运行时手动将其作为参数传递?

编辑 当然,因为我使用的是 HDP,所以我使用的是 Ambari.之前的集群管理员已经在所有机器上安装了 Spark2 客户端,因此所有可能是潜在 Spark 驱动程序的 YARN NodeManager 都应该具有相同的配置文件

解决方案

您可以使用 spark 属性 - spark.yarn.dist.files 并在那里指定 hive-site.xml 的路径.

Using HDP 2.5.3 and I've been trying to debug some YARN container classpath issues.

Since HDP includes both Spark 1.6 and 2.0.0, there have been some conflicting versions

Users I support are successfully able to use Spark2 with Hive queries in YARN client mode, but not from cluster mode they get errors about tables not found, or something like that because the Metastore connection isn't established.

I am guessing that setting either --driver-class-path /etc/spark2/conf:/etc/hive/conf or passing --files /etc/spark2/conf/hive-site.xml after spark-submit would work, but why isn't hive-site.xml loaded already from the conf folder?

Accoringing to Hortonworks docs, says hive-site should be placed in $SPARK_HOME/conf, and it is...

I see hdfs-site.xml and core-site.xml, and other files that are part of HADOOP_CONF_DIR, for example, and this is the from the YARN UI container info.

2232355    4 drwx------   2 yarn     hadoop       4096 Aug  2 21:59 ./__spark_conf__
2232379    4 -r-x------   1 yarn     hadoop       2358 Aug  2 21:59 ./__spark_conf__/topology_script.py
2232381    8 -r-x------   1 yarn     hadoop       4676 Aug  2 21:59 ./__spark_conf__/yarn-env.sh
2232392    4 -r-x------   1 yarn     hadoop        569 Aug  2 21:59 ./__spark_conf__/topology_mappings.data
2232398    4 -r-x------   1 yarn     hadoop        945 Aug  2 21:59 ./__spark_conf__/taskcontroller.cfg
2232356    4 -r-x------   1 yarn     hadoop        620 Aug  2 21:59 ./__spark_conf__/log4j.properties
2232382   12 -r-x------   1 yarn     hadoop       8960 Aug  2 21:59 ./__spark_conf__/hdfs-site.xml
2232371    4 -r-x------   1 yarn     hadoop       2090 Aug  2 21:59 ./__spark_conf__/hadoop-metrics2.properties
2232387    4 -r-x------   1 yarn     hadoop        662 Aug  2 21:59 ./__spark_conf__/mapred-env.sh
2232390    4 -r-x------   1 yarn     hadoop       1308 Aug  2 21:59 ./__spark_conf__/hadoop-policy.xml
2232399    4 -r-x------   1 yarn     hadoop       1480 Aug  2 21:59 ./__spark_conf__/__spark_conf__.properties
2232389    4 -r-x------   1 yarn     hadoop       1602 Aug  2 21:59 ./__spark_conf__/health_check
2232385    4 -r-x------   1 yarn     hadoop        913 Aug  2 21:59 ./__spark_conf__/rack_topology.data
2232377    4 -r-x------   1 yarn     hadoop       1484 Aug  2 21:59 ./__spark_conf__/ranger-hdfs-audit.xml
2232383    4 -r-x------   1 yarn     hadoop       1020 Aug  2 21:59 ./__spark_conf__/commons-logging.properties
2232357    8 -r-x------   1 yarn     hadoop       5721 Aug  2 21:59 ./__spark_conf__/hadoop-env.sh
2232391    4 -r-x------   1 yarn     hadoop        281 Aug  2 21:59 ./__spark_conf__/slaves
2232373    8 -r-x------   1 yarn     hadoop       6407 Aug  2 21:59 ./__spark_conf__/core-site.xml
2232393    4 -r-x------   1 yarn     hadoop        812 Aug  2 21:59 ./__spark_conf__/rack-topology.sh
2232394    4 -r-x------   1 yarn     hadoop       1044 Aug  2 21:59 ./__spark_conf__/ranger-hdfs-security.xml
2232395    8 -r-x------   1 yarn     hadoop       4956 Aug  2 21:59 ./__spark_conf__/metrics.properties
2232386    8 -r-x------   1 yarn     hadoop       4221 Aug  2 21:59 ./__spark_conf__/task-log4j.properties
2232380    4 -r-x------   1 yarn     hadoop         64 Aug  2 21:59 ./__spark_conf__/ranger-security.xml
2232372   20 -r-x------   1 yarn     hadoop      19975 Aug  2 21:59 ./__spark_conf__/yarn-site.xml
2232397    4 -r-x------   1 yarn     hadoop       1006 Aug  2 21:59 ./__spark_conf__/ranger-policymgr-ssl.xml
2232374    4 -r-x------   1 yarn     hadoop         29 Aug  2 21:59 ./__spark_conf__/yarn.exclude
2232384    4 -r-x------   1 yarn     hadoop       1606 Aug  2 21:59 ./__spark_conf__/container-executor.cfg
2232396    4 -r-x------   1 yarn     hadoop       1000 Aug  2 21:59 ./__spark_conf__/ssl-server.xml
2232375    4 -r-x------   1 yarn     hadoop          1 Aug  2 21:59 ./__spark_conf__/dfs.exclude
2232359    8 -r-x------   1 yarn     hadoop       7660 Aug  2 21:59 ./__spark_conf__/mapred-site.xml
2232378   16 -r-x------   1 yarn     hadoop      14474 Aug  2 21:59 ./__spark_conf__/capacity-scheduler.xml
2232376    4 -r-x------   1 yarn     hadoop        884 Aug  2 21:59 ./__spark_conf__/ssl-client.xml

As you might see, hive-site is not there, even though I definitely have conf/hive-site.xml for spark-submit to take

[spark@asthad006 conf]$ pwd && ls -l
/usr/hdp/2.5.3.0-37/spark2/conf
total 32
-rw-r--r-- 1 spark spark   742 Mar  6 15:20 hive-site.xml
-rw-r--r-- 1 spark spark   620 Mar  6 15:20 log4j.properties
-rw-r--r-- 1 spark spark  4956 Mar  6 15:20 metrics.properties
-rw-r--r-- 1 spark spark   824 Aug  2 22:24 spark-defaults.conf
-rw-r--r-- 1 spark spark  1820 Aug  2 22:24 spark-env.sh
-rwxr-xr-x 1 spark spark   244 Mar  6 15:20 spark-thrift-fairscheduler.xml
-rw-r--r-- 1 hive  hadoop  918 Aug  2 22:24 spark-thrift-sparkconf.conf

So, I don't think I am supposed to place hive-site in HADOOP_CONF_DIR as HIVE_CONF_DIR is separated, but my question is that how do we get Spark2 to pick up the hive-site.xml without needing to manually pass it as a parameter at runtime?

EDIT Naturally, since I'm on HDP I am using Ambari. The previous cluster admin has installed Spark2 clients on all of the machines, so all of the YARN NodeManagers that could be potential Spark drivers should have the same config files

解决方案

You can use spark property - spark.yarn.dist.files and specify path to hive-site.xml there.

这篇关于使用 spark-submit YARN 集群模式时缺少 hive-site的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆