多星火应用与HiveContext [英] Multiple Spark applications with HiveContext
问题描述
有了这样的地方的 SQLContext
让两个应用程序失败的一个实例化 HiveContext
两个独立pyspark应用与错误:
Having two separate pyspark applications that instantiate a HiveContext
in place of a SQLContext
lets one of the two applications fail with the error:
例外:(你必须建立星火与蜂巢导出SPARK_HIVE =真正的'和运行编译/ SBT大会,Py4JJavaError(u'An同时呼吁None.org.apache.spark.sql.hive.HiveContext发生错误\\ N',JAVAOBJECT ID = o34039))
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
其他应用程序成功终止。
The other application terminates successfully.
我使用的Spark 1.6从Python API,并希望利用一些数据框
功能,即只用 HiveContext 支持code>(如
collect_set
)。我已经在1.5.2同样的问题和更早版本。
I am using Spark 1.6 from the Python API and want to make use of some Dataframe
functions, that are only supported with a HiveContext
(e.g. collect_set
). I've had the same issue on 1.5.2 and earlier.
这是足以重现:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
的睡眠
是只是为了保持脚本运行时我启动其他程序。
The sleep
is just to keep the script running while I start the other process.
如果我有此脚本运行的两个实例,上述错误读取拼花文件时显示。当我把 HiveContext
与 SQLContext
一切都很好。
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext
with SQLContext
everything's fine.
有谁知道这是为什么?
推荐答案
默认配置单元(上下文)是采用嵌入式Derby作为metastore。它的目的主要是用于测试和仅支持一个活动的用户。如果你想支持多个运行的应用程序,你应该配置独立metastore。这时蜂巢支持的PostgreSQL,MySQL和甲骨文和MySQL。配置的细节取决于后端和期权(本地/远程),但一般来说,你需要:
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
- 正在运行的RDBMS服务器
- 使用提供的脚本创建了一个数据库metastore
- 适当配置单元配置
- a running RDBMS server
- a metastore database created using provided scripts
- a proper Hive configuration
的Cloudera提供了一个COM prehensive指南可能对您有用:<一href=\"https://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html\"相对=nofollow>配置蜂巢Metastore 。
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
理论上应该是也可以创建单独的德比metastores用适当的配置(参见<一href=\"https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Local/EmbeddedMetastoreDatabase%28Derby%29\"相对=nofollow>蜂巢管理操作手册 - 本地/嵌入式数据库Metastore ),或使用的德比在服务器模式。
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
这篇关于多星火应用与HiveContext的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!