多星火应用与HiveContext [英] Multiple Spark applications with HiveContext

查看：1037 发布时间：2016/5/22 15:13:48 apache-spark hive pyspark

本文介绍了多星火应用与HiveContext的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有了这样的地方的 SQLContext 让两个应用程序失败的一个实例化 HiveContext 两个独立pyspark应用与错误：

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:

例外：（你必须建立星火与蜂巢导出SPARK_HIVE =真正的'和运行编译/ SBT大会，Py4JJavaError（u'An同时呼吁None.org.apache.spark.sql.hive.HiveContext发生错误\\ N'，JAVAOBJECT ID = o34039））

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))

其他应用程序成功终止。

The other application terminates successfully.

我使用的Spark 1.6从Python API，并希望利用一些数据框功能，即只用 HiveContext （如 collect_set ）。我已经在1.5.2同样的问题和更早版本。


I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
这是足以重现：
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)

data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)

的睡眠是只是为了保持脚本运行时我启动其他程序。
The sleep is just to keep the script running while I start the other process.
如果我有此脚本运行的两个实例，上述错误读取拼花文件时显示。当我把 HiveContext 与 SQLContext 一切都很好。
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
有谁知道这是为什么？
推荐答案
默认配置单元（上下文）是采用嵌入式Derby作为metastore。它的目的主要是用于测试和仅支持一个活动的用户。如果你想支持多个运行的应用程序，你应该配置独立metastore。这时蜂巢支持的PostgreSQL，MySQL和甲骨文和MySQL。配置的细节取决于后端和期权（本地/远程），但一般来说，你需要：
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:

正在运行的RDBMS服务器

使用提供的脚本创建了一个数据库metastore  

适当配置单元配置 



a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration

的Cloudera提供了一个COM prehensive指南可能对您有用：<一href=\"https://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html\"相对=nofollow>配置蜂巢Metastore 。
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
理论上应该是也可以创建单独的德比metastores用适当的配置（参见<一href=\"https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Local/EmbeddedMetastoreDatabase%28Derby%29\"相对=nofollow>蜂巢管理操作手册 - 本地/嵌入式数据库Metastore ），或使用的德比在服务器模式。
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.

                        这篇关于多星火应用与HiveContext的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

多星火应用与HiveContext [英] Multiple Spark applications with HiveContext

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

多星火应用与HiveContext [英] Multiple Spark applications with HiveContext

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭