使用 HiveContext 的多个 Spark 应用程序 [英] Multiple Spark applications with HiveContext

查看：34 发布时间：2021/11/12 5:38:42 apache-spark hive pyspark

本文介绍了使用 HiveContext 的多个 Spark 应用程序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有两个单独的 pyspark 应用程序来实例化 HiveContext 代替 SQLContext 会让两个应用程序之一失败并出现错误:

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:

Exception: ("You must build Spark with Hive.Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'在调用 None.org.apache.spark.sql.hive.HiveContext 时发生错误.\n', JavaObject id=o34039))

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))

另一个应用程序成功终止.

The other application terminates successfully.

我正在使用 Python API 中的 Spark 1.6，并希望使用一些 Dataframe 函数，这些函数仅支持 HiveContext(例如 collect_set).我在 1.5.2 及更早版本上遇到过同样的问题.


I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
这足以重现:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)

data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)

sleep 只是为了在我启动另一个进程时保持脚本运行.
The sleep is just to keep the script running while I start the other process.
如果我运行了该脚本的两个实例，则在读取镶木地板文件时会显示上述错误.当我用 SQLContext 替换 HiveContext 时，一切都很好.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
有人知道这是为什么吗?
Does anyone know why that is?
推荐答案
默认情况下，Hive(Context) 使用嵌入式 Derby 作为 Metastore.它主要用于测试并且仅支持一名活动用户.如果你想支持多个运行的应用程序，你应该配置一个独立的 Metastore.目前 Hive 支持 PostgreSQL、MySQL、Oracle 和 MySQL.配置细节取决于后端和选项(本地/远程)，但一般来说您需要:
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
正在运行的 RDBMS 服务器
使用提供的脚本创建的Metastore数据库立>
适当的 Hive 配置


a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration

Cloudera 提供了一份您可能会觉得有用的综合指南:配置 Hive Metastore.
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
理论上，也应该可以通过适当的配置创建单独的 Derby 元存储(请参阅 Hive 管理手册 - 本地/嵌入式 Metastore 数据库)或使用 服务器模式下的德比.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
对于开发，您可以在不同的工作目录中启动应用程序.这将创建单独的 metastore_db为每个应用程序，避免多个活跃用户的问题.提供单独的 Hive 配置也应该有效，但在开发中不太有用:
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
hive-site.xml未配置时，context会在当前目录自动创建metastore_db

  When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory 

                        这篇关于使用 HiveContext 的多个 Spark 应用程序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用 HiveContext 的多个 Spark 应用程序 [英] Multiple Spark applications with HiveContext

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 HiveContext 的多个 Spark 应用程序 [英] Multiple Spark applications with HiveContext

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭