Spark Spark vs Hive on Spark - 区别和优缺点? [英] SparkSQL vs Hive on Spark - Difference and pros and cons?
问题描述
- $ b $当SparkSQL使用hive时,SparkSQL可以使用HiveMetastore来获取存储在HDFS中的数据的元数据。此元数据使SparkSQL可以对其执行的查询执行更好的优化。这里Spark是查询处理器。当Hive使用Spark >请参阅JIRA条目:HIVE-7292
这里的数据是通过spark访问的。 Hive是查询处理器。所以我们拥有Spark Core的所有设计特性。但这是对Hive的一项重大改进,截至2016年2月2日仍在进行中。 第三种方法是使用SparkSQL处理数据
使用SparkSQL时不使用Hive。这里SparkSQL无法访问Hive Metastore中的元数据。查询运行速度较慢。我已经做了一些比较选项1和3的性能测试。结果是此处。
SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
这篇关于Spark Spark vs Hive on Spark - 区别和优缺点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!