spark的HiveContext内部如何工作? [英] How HiveContext of spark internally works?

查看:30
本文介绍了spark的HiveContext内部如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 的新手.我发现使用 HiveContext 我们可以连接到 hive 并运行 HiveQLs.我运行它并且它起作用了.

I am new to Spark.I found using HiveContext we can connect to hive and run HiveQLs. I run it and it worked.

我的疑问是 Spark 是不是通过 spark jobs 做到的.也就是说,它使用 HiveContext 只用于从 HDFS 访问相应的 hive 表文件

My doubt is whether Spark does it through spark jobs .That is, it uses HiveContext only for accessing corresponding hive table files from HDFS

或者

它内部调用 hive 来执行查询?

It internally calls hive to execute the query?

推荐答案

不,Spark 不会调用 hive 来执行查询.Spark 只从 hive 读取元数据并在 Spark 引擎中执行查询.Spark 拥有自己的 SQL 执行引擎,其中包括诸如催化剂、钨等组件来优化查询并提供更快的结果.它使用来自 hive 的元数据和 Spark 的执行引擎来运行查询.

No, Spark doesn't call the hive to execute query. Spark only reads the metadata from hive and executes the query within Spark engine. Spark has it's own SQL execution engine which includes components such as catalyst, tungsten to optimize queries and give faster results. It uses meta data from hive and execution engine of spark to run the queries.

Hive 的最大优势之一是它的 Metastore.它充当了 hadoop 生态系统中许多组件的单一元存储.

One of the greatest advantages of Hive is it's metastore. It acts as a single meta store for lot of components in hadoop eco system.

谈到你的问题,当你使用 HiveContext 时,它会访问 Metastore db 和你所有的 Hive Meta Data,这可以清楚地解释你拥有什么类型的数据,你在哪里有数据,序列化和反序列化,压缩编解码器、列、数据类型以及关于表及其数据的每一个细节.这足以让 spark 理解数据.

Coming to your question, when you use HiveContext, it will get access to metastore db and all your Hive Meta Data, which can clearly explain what type of data you have , where do you have the data , serialization and deserializations, compression codecs, columns, datatypes and literally every detail about the table and it's data. That is enough for spark to understand the data.

总的来说,Spark 只需要提供底层数据完整细节的元存储,一旦它有了元数据,它就会通过它的执行引擎执行你要求的查询.Hive 比 Spark 慢,因为它使用 MapReduce.因此,返回 hive 并要求在 hive 中运行它是没有意义的.

Overall, Spark only needs metastore which gives complete details of underlying data and once it has the metadata, it will execute the queries that you asked for, over its on execution engine. Hive is slower than Spark as it uses MapReduce. So, there is no point in going back to hive and ask to run it in hive.

如果它回答了您的问题,请告诉我.

Let me know if it answers ur question.

这篇关于spark的HiveContext内部如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆