Spark SQL是否使用Hive Metastore? [英] Does Spark SQL use Hive Metastore?

查看:405
本文介绍了Spark SQL是否使用Hive Metastore?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发Spark SQL应用程序,但我有几个问题:

I am developing a Spark SQL application and I've got few questions:

  1. 我读到Spark-SQL在后台使用Hive metastore吗?这是真的?我说的是一个纯粹的Spark-SQL应用程序,该应用程序未明确连接到任何Hive安装.
  2. 我正在启动一个Spark-SQL应用程序,不需要使用Hive.有什么理由要使用Hive吗?据我了解,Spark-SQL比Hive快得多.因此,我认为没有任何理由可以使用Hive.但是我正确吗?

推荐答案

我读到Spark-SQL在后台使用Hive metastore吗?这是真的?我说的是纯的Spark-SQL应用程序,该应用程序未明确连接到任何Hive安装.

I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.

Spark SQL不在幕后使用Hive元存储(除非您位于spark-shell中,否则默认为in-memory非Hive目录).

Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite).

默认外部目录实现由 spark.sql.catalogImplementation 内部属性控制,并且可以是两个可能值之一:hivein-memory.

The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.

使用SparkSession了解正在使用的目录.

Use the SparkSession to know what catalog is in use.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> spark.version
res0: String = 2.4.0

scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener

scala> println(spark.sharedState.externalCatalog.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog@49d5b651

请注意,我使用的spark-shell确实会启动Hive感知的SparkSession,因此我必须使用--conf spark.sql.catalogImplementation=in-memory启动它以将其关闭.

Please note that I used spark-shell that does start a Hive-aware SparkSession and so I had to start it with --conf spark.sql.catalogImplementation=in-memory to turn it off.

我正在启动一个Spark-SQL应用程序,不需要使用Hive.有什么理由要使用Hive吗?据我了解,Spark-SQL比Hive快得多.因此,我认为没有任何理由可以使用Hive.

I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive.

这是一个非常有趣的问题,可以有不同的答案(有些甚至是主要基于意见的,因此我们必须格外小心并遵守StackOverflow规则).

That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules).

有什么理由要使用Hive吗?

Is there any reason to use Hive?

否.

但是...如果您想使用Spark 2.2的最新功能(即基于成本的优化器),您可能希望将其视为ANALYZE TABLE,因为成本统计信息可能会非常昂贵,因此对表执行一次在不同的Spark应用程序运行中反复使用这些功能可以提高性能.

But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as ANALYZE TABLE for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost.

请注意,没有Hive的Spark SQL也可以做到这一点,但是有局限性,因为本地默认元存储仅用于单用户访问,并且无法同时使用跨提交的Spark应用程序重新使用元数据./p>

Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.

我没有发现使用Hive的任何理由.

I don't see any reason to use Hive.

我写了一篇博客文章

I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments.

Hive本身只是HDFS上的数据仓库,因此,如果您拥有Spark SQL,它不会有太多用处,但是Hive仍然有一些概念做得很好,这些概念在Spark SQL中有很多用途(直到它完全站在它上面拥有类似Hive的metastore的自己的腿.)

Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).

这篇关于Spark SQL是否使用Hive Metastore?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆