Spark 2:当 SparkSession enableHiveSupport() 被调用时它是如何工作的 [英] Spark 2: how does it work when SparkSession enableHiveSupport() is invoked
问题描述
我的问题很简单,但不知何故,我无法通过阅读文档找到明确的答案.
My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.
我在 CDH 5.10 集群上运行 Spark2.还有 Hive 和 Metastore.
I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.
我在我的 Spark 程序中创建了一个会话,如下所示:
I create a session in my Spark program as follows:
SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
假设我有以下 HiveQL 查询:
Suppose I have the following HiveQL query:
spark.sql("SELECT someColumn FROM someTable")
我想知道是否:
- 在后台,这个查询被翻译成 Hive MapReduce 原语,或者
- 对 HiveQL 的支持仅在语法层面,Spark SQL 将在幕后使用.
我正在做一些性能评估,我不知道我是否应该声明使用 spark.sql([hiveQL query])
执行的查询的时间性能参考 Spark 或 Hive.>
I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query])
refer to Spark or Hive.
推荐答案
Spark 知道两个目录,hive 和 in-memory.如果设置 enableHiveSupport()
,则 spark.sql.catalogImplementation
设置为 hive
,否则设置为 in-memory
.因此,如果您启用 hive 支持,spark.catalog.listTables().show()
将显示来自 hive Metastore 的所有表.
Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport()
, then spark.sql.catalogImplementation
is set to hive
, otherwise to in-memory
. So if you enable hive support, spark.catalog.listTables().show()
will show you all tables from the hive metastore.
但这并不意味着hive用于查询*,只是意味着spark与hive-metastore通信,执行引擎始终是spark.
But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.
*实际上有一些函数,例如 percentile
和 percentile_approx
是原生的 hive UDAF.
*there are actually some functions like percentile
und percentile_approx
which are native hive UDAF.
这篇关于Spark 2:当 SparkSession enableHiveSupport() 被调用时它是如何工作的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!