如何使pyspark和SparkSQL在Spark上执行Hive? [英] How can I make the pyspark and SparkSQL to execute the Hive on Spark?

查看:154
本文介绍了如何使pyspark和SparkSQL在Spark上执行Hive?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经安装并设置了在纱上产生火花以及集成用Hive表进行火花.通过使用 spark-shell / pyspark ,我还遵循了

I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark, I also follow the simple tutorial and achieve to create Hive table, load data and then select properly.

然后我进入下一步,设置

Then I move to the next step, setting Hive on Spark. By using hive / beeline, I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: -

hive> select sum(col1) from test_table;
....
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      3          3        0        0       0
Stage-1 ........         0      FINISHED      1          1        0        0       0
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 55.26 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 55.26 second(s)
OK
6
Time taken: 99.165 seconds, Fetched: 1 row(s)

资源管理器用户界面还将 RUNNING 应用程序显示为 Hive on Spark(sessionId = ....),我可以访问 ApplicationMaster 来查找详细信息查询.

The resource manager UI also displays the RUNNING application as Hive on Spark (sessionId = ....) and I am able to visit the ApplicationMaster for looking the detail query as well.

我目前尚无法实现的步骤是将 pyspark / SparkSQL 集成到 Hive on Spark .

The current step which I cannot achieve yet is integrating the pyspark/SparkSQL to the Hive on Spark.

  1. $ SPARK_HOME/conf/hive-site.xml 编辑为 hive.execution.engine = spark .

    <property>
        <name>hive.execution.engine</name>
        <value>spark</value>
        <description>
            Expects one of [mr, tez, spark].
        </description>
    </property>

  1. 使用 bin/pyspark 登录pyspark并检查 hive.execution.engine .
  1. Login to pyspark by using bin/pyspark and check the hive.execution.engine.

>>> spark.sql("set spark.master").show()
+------------+-----+
|         key|value|
+------------+-----+
|spark.master| yarn|
+------------+-----+

>>> spark.sql("set spark.submit.deployMode").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|spark.submit.depl...|client|
+--------------------+------+

>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----------+
|                 key|      value|
+--------------------+-----------+
|hive.execution.en...|<undefined>|
+--------------------+-----------+

  1. 因为, hive.execution.engine 没有任何值(非常惊讶!我已经设置了hive-site.xml!),我决定手动将其设置为以下内容:-
  1. Since, there is no any value for hive.execution.engine (quite surprised ! I've set the hive-site.xml !), I decide to set it manually as the following:-

>>> spark.sql("set hive.execution.engine=spark")
>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|hive.execution.en...|spark|
+--------------------+-----+

  1. 使用SparkSQL从Hive中选择数据

>>> spark.sql("select sum(col1) from test_table").show()
+---------+
|sum(col1)|
+---------+
|        6|
+---------+

  1. 即使显示了结果,但在资源管理器中也没有显示任何应用程序.我了解 SparkSQL 不使用 Hive On Spark .我对此一无所知.
  1. Even the result is shown, but there is no application displayed at resource manager. I understand that SparkSQL does not use the Hive On Spark. I've no any clue about this.

问题是

  1. 如何使 pyspark / SparkSQL 使用 Hive on Spark ?
  2. 这样做是否适合加快并远离 mr 执行引擎?
  3. 我混合使用了错误的成分吗?还是不可能?

推荐答案

"Hive on Spark" 的缩写,"HiveServer2默认使用Spark执行引擎" .

  • HS2服务的客户是什么?将Hive视为常规数据库的应用程序,它们通过JDBC(Java/Scala应用程序,例如 beeline )或ODBC(R脚本​​,Windows应用程序)或DBI(Python应用程序和应用程序)进行连接;脚本),并提交SQL查询
  • 这适用于Spark作业吗?不...!Spark希望对数据文件进行原始访问.本质上,Spark 自己的数据库引擎;甚至还有Spark ThriftServer都可以用作HS2的(原始)替代品.
  • what are the clients of HS2 service? Apps that consider Hive as a regular database, connecting via JDBC (Java/Scala apps such as beeline) or ODBC (R scripts, Windows apps) or DBI (Python apps & scripts), and submitting SQL queries
  • Does that apply to Spark jobs? No...! Spark wants raw access to the data files. In essence, Spark is its own database engine; there is even the Spark ThriftServer that can be used as a (crude) replacement for HS2.


构建Spark与Hive V1或Hive V2进行交互时,它仅与 MetaStore 服务进​​行交互-即元数据目录,使其可以用于多个系统(HiveServer2/Presto/Impala/Spark作业/Spark ThriftServer等)共享数据库"和表"的相同定义,包括数据文件的位置(即HDFS目录/S3伪目录/等)


When Spark is built to interact with Hive V1 or Hive V2, it only interacts with the MetaStore service -- i.e. the metadata catalog that makes it possible for multiple systems (HiveServer2 / Presto / Impala / Spark jobs / Spark ThriftServer / etc) to share the same definition for "databases" and "tables", including the location of the data files (i.e. the HDFS directories / S3 pseudo-directories / etc)

但是每个系统都有自己的库来读写表"-HiveServer2使用YARN作业(可以选择执行引擎,例如MapReduce,TEZ,Spark);Impala和Presto在YARN之外运行自己的执行引擎;Spark在YARN内部或外部运行自己的执行引擎.

But each system has its own libraries to read and write into the "tables" -- HiveServer2 uses YARN jobs (with a choice of execution engines such as MapReduce, TEZ, Spark); Impala and Presto have their own execution engines running outside of YARN; Spark has its own execution engine running inside or outside of YARN.

不幸的是,这些系统无法协调其读/写操作,这可能是一团糟(即,Hive SELECT查询可能会崩溃,因为Spark作业刚刚在重建分区时删除了文件,反之-反之亦然),尽管Metastore提供了一个API来管理ZooKeeper中的读/写锁定.显然,只有HS2支持该API,并且默认情况下它甚至都不处于活动状态.

And unfortunately these systems do not coordinate their read/write operations, which can be a real mess (i.e. a Hive SELECT query may crash because a Spark job has just deleted a file while rebuilding a partition, and vice-versa), although the Metastore provides an API to manage read/write locks in ZooKeeper. Only HS2 supports that API, apparently, and it's not even active by default.

PS:Hive LLAP是又一个系统,它使用具有TZE的YARN(没有其他选择),但是具有额外的持久性层和用于缓存的内存网格-即不是您的常规HiveServer2,而是一种演进HortonWorks是Impala和Presto的竞争对手.


构建Spark与Hive V3"HortonWorks风格"进行交互时,有一个陷阱:


When Spark is built to interact with Hive V3 "HortonWorks-style", there is a catch:

  • 默认情况下,HiveServer2使用Spark不支持的特定数据格式(ORC变体)来管理"ACID表"
  • 默认情况下,Metastore通过为HS2和Spark使用不同的命名空间来阻止Spark知道任何HiveServer2表-有效地消除了拥有单个共享目录的目的... !!
  • 因此,霍顿为Spark提供了专用的连接器",以通过HS2访问Hive表-否定了使用Spark执行引擎的目的……!

由于Horton已被Cloudera吸收,因此Spark与Metastore集成的未来尚不清楚.Horton发行版中的大多数优质零件正在替换Cloudera的the脚(或缺失)零件;但是具体的发展显然不是很好.

Since Horton has been absorbed by Cloudera, the future of Spark integration with the Metastore is not clear. Most of the good parts from Horton distro are replacing the lame (or missing) parts from Cloudera; but that specific development was not obviously good.

这篇关于如何使pyspark和SparkSQL在Spark上执行Hive?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆