在纱线上运行时,自定义 spark 找不到 hive 数据库 [英] custom spark does not find hive databases when running on yarn

查看:49
本文介绍了在纱线上运行时,自定义 spark 找不到 hive 数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照 https://georgheiler.com/2019/05/01/headless-spark-on-yarn/ 即以下内容:

# download a current headless version of spark
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf
export SPARK_HOME=<<path/to>>/spark-2.4.3-bin-without-hadoop/
<<path/to>>/spark-2.4.3-bin-without-hadoop/bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>> --conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.<<version>>' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.<<version>>'

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

然而,一个:

spark.sql("show databases").show

只返回:

+------------+
|databaseName|
+------------+
|     default|
+------------+

现在尝试传递原始 HDP 配置(显然我的自定义版本的 spark 没有读入),例如:

Now trying to pass the original HDP configuration (which is apparently not read in by my custom version of spark) like:

一:

--files /usr/hdp/current/spark2-client/conf/hive-site.xml

二:

--conf spark.hive.metastore.uris='thrift://master001.my.corp.com:9083,thrift://master002.my.corp.com:9083,thrift://master003.my.corp.com:9083' --conf spark.hive.metastore.sasl.enabled='true' --conf hive.metastore.uris='thrift://master001.my.corp.com:9083,thrift://master002.my.corp.com:9083,thrift://master003.my.corp.com:9083' --conf hive.metastore.sasl.enabled='true'

三:

--conf spark.yarn.dist.files='/usr/hdp/current/spark2-client/conf/hive-site.xml'

四:

--conf spark.sql.warehouse.dir='/apps/hive/warehouse'

都无助于解决问题.如何让 spark 识别 hive 数据库?

all does not help to solve the issue. How can I get spark to recognize the hive databases?

推荐答案

Hive jars 需要在 Spark 的类路径中才能启用 hive 支持.如果类路径中不存在 hive jar,则目录实现使用的是内存
在 spark-shell 中,我们可以通过执行

Hive jars need to be in the classpath of Spark for hive support to be enabled. if the hive jars are not present in classpath, the catalog implementation used is in-memory
In spark-shell we can confirm this by executing

sc.getConf.get("spark.sql.catalogImplementation") 

这将给 in-memory

    def enableHiveSupport(): Builder = synchronized {
      if (hiveClassesArePresent) {
        config(CATALOG_IMPLEMENTATION.key, "hive")
      } else {
        throw new IllegalArgumentException(
          "Unable to instantiate SparkSession with Hive support because " +
            "Hive classes are not found.")
      }
    }

SparkSession.scala

  private[spark] def hiveClassesArePresent: Boolean = {
    try {
      Utils.classForName(HIVE_SESSION_STATE_BUILDER_CLASS_NAME)
      Utils.classForName("org.apache.hadoop.hive.conf.HiveConf")
      true
    } catch {
      case _: ClassNotFoundException | _: NoClassDefFoundError => false
    }
  }

如果类不存在,则未启用 Hive 支持.链接到 代码 上面的检查是作为 Spark shell 初始化的一部分进行的.

If the classes are not present, Hive support is not enabled. Link to the code where the above checks happen as part of Spark shell initialization.

在作为问题的一部分粘贴的上述代码中,SPARK_DIST_CLASSPATH 仅填充了 Hadoop 类路径和缺少 Hive jar 的路径.

In the above code pasted as part of question, SPARK_DIST_CLASSPATH is populated only with the Hadoop classpath and the paths to Hive jars missing.

这篇关于在纱线上运行时,自定义 spark 找不到 hive 数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆