Spark 2:调用SparkSession enableHiveSupport()时如何工作 [英] Spark 2: how does it work when SparkSession enableHiveSupport() is invoked
问题描述
我的问题很简单,但是以某种方式我无法通过阅读文档找到明确的答案.
My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.
我有 Spark2 在 CDH 5.10 集群上运行. 还有Hive和一个Metastore.
I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.
我在Spark程序中创建一个会话,如下所示:
I create a session in my Spark program as follows:
SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()
假设我有以下HiveQL查询:
Suppose I have the following HiveQL query:
spark.sql("SELECT someColumn FROM someTable")
我想知道是否:
- 在后台将这个查询翻译为Hive MapReduce原语,或
- 对HiveQL的支持仅在语法上,Spark SQL将在后台使用.
我正在做一些性能评估,我不知道是否应该声明用spark.sql([hiveQL query])
执行的查询的时间性能,这些性能是指Spark或Hive.
I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query])
refer to Spark or Hive.
推荐答案
Spark知道两个目录,hive和内存中.如果设置enableHiveSupport()
,则spark.sql.catalogImplementation
设置为hive
,否则设置为in-memory
.因此,如果启用hive支持,spark.catalog.listTables().show()
将为您显示hive元存储中的所有表.
Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport()
, then spark.sql.catalogImplementation
is set to hive
, otherwise to in-memory
. So if you enable hive support, spark.catalog.listTables().show()
will show you all tables from the hive metastore.
但这并不意味着hive用于查询*,它仅表示spark与hive-metastore通信,执行引擎始终是spark.
But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.
*实际上有一些功能,例如percentile
和percentile_approx
都是本机配置单元UDAF.
*there are actually some functions like percentile
und percentile_approx
which are native hive UDAF.
这篇关于Spark 2:调用SparkSession enableHiveSupport()时如何工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!