如何使pyspark和SparkSQL在Spark上执行Hive? [英] How can I make the pyspark and SparkSQL to execute the Hive on Spark?

查看：154 发布时间：2021/4/8 20:16:02 python apache-spark pyspark hive apache-spark-sql

本文介绍了如何使pyspark和SparkSQL在Spark上执行Hive?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经安装并设置了在纱上产生火花以及集成用Hive表进行火花.通过使用 spark-shell / pyspark ，我还遵循了

I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark, I also follow the simple tutorial and achieve to create Hive table, load data and then select properly.

然后我进入下一步，设置

Then I move to the next step, setting Hive on Spark. By using hive / beeline, I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: -

hive> select sum(col1) from test_table;
....
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      3          3        0        0       0
Stage-1 ........         0      FINISHED      1          1        0        0       0
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 55.26 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 55.26 second(s)
OK
6
Time taken: 99.165 seconds, Fetched: 1 row(s)

资源管理器用户界面还将 RUNNING 应用程序显示为 Hive on Spark(sessionId = ....)，我可以访问 ApplicationMaster 来查找详细信息查询.

The resource manager UI also displays the RUNNING application as Hive on Spark (sessionId = ....) and I am able to visit the ApplicationMaster for looking the detail query as well.

我目前尚无法实现的步骤是将 pyspark / SparkSQL 集成到 Hive on Spark .

The current step which I cannot achieve yet is integrating the pyspark/SparkSQL to the Hive on Spark.

将 $ SPARK_HOME/conf/hive-site.xml 编辑为 hive.execution.engine = spark .

    <property>
        <name>hive.execution.engine</name>
        <value>spark</value>
        <description>
            Expects one of [mr, tez, spark].
        </description>
    </property>

使用 bin/pyspark 登录pyspark并检查 hive.execution.engine .

>>> spark.sql("set spark.master").show()
+------------+-----+
|         key|value|
+------------+-----+
|spark.master| yarn|
+------------+-----+

>>> spark.sql("set spark.submit.deployMode").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|spark.submit.depl...|client|
+--------------------+------+

>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----------+
|                 key|      value|
+--------------------+-----------+
|hive.execution.en...|<undefined>|
+--------------------+-----------+

因为， hive.execution.engine 没有任何值(非常惊讶！我已经设置了hive-site.xml！)，我决定手动将其设置为以下内容:-

Since, there is no any value for hive.execution.engine (quite surprised ! I've set the hive-site.xml !), I decide to set it manually as the following:-

>>> spark.sql("set hive.execution.engine=spark")
>>> spark.sql("set hive.execution.engine").show()
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|hive.execution.en...|spark|
+--------------------+-----+

使用SparkSQL从Hive中选择数据

>>> spark.sql("select sum(col1) from test_table").show()
+---------+
|sum(col1)|
+---------+
|        6|
+---------+

即使显示了结果，但在资源管理器中也没有显示任何应用程序.我了解 SparkSQL 不使用 Hive On Spark .我对此一无所知.

Even the result is shown, but there is no application displayed at resource manager. I understand that SparkSQL does not use the Hive On Spark. I've no any clue about this.

问题是

如何使 pyspark / SparkSQL 使用 Hive on Spark ?
这样做是否适合加快并远离 mr 执行引擎?
我混合使用了错误的成分吗?还是不可能?

如何使pyspark和SparkSQL在Spark上执行Hive? [英] How can I make the pyspark and SparkSQL to execute the Hive on Spark?

问题描述

问题是

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使pyspark和SparkSQL在Spark上执行Hive? [英] How can I make the pyspark and SparkSQL to execute the Hive on Spark?

问题描述

问题是

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭