多星火应用与HiveContext [英] Multiple Spark applications with HiveContext

查看:1037
本文介绍了多星火应用与HiveContext的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有了这样的地方的 SQLContext 让两个应用程序失败的一个实例化 HiveContext 两个独立pyspark应用与错误:

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:

例外:(你必须建立星火与蜂巢导出SPARK_HIVE =真正的'和运行编译/ SBT大会,Py4JJavaError(u'An同时呼吁None.org.apache.spark.sql.hive.HiveContext发生错误\\ N',JAVAOBJECT ID = o34039))

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))

其他应用程序成功终止。

The other application terminates successfully.

我使用的Spark 1.6从Python API,并希望利用一些数据框功能,即只用 HiveContext (如 collect_set )。我已经在1.5.2同样的问题和更早版本。

I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.

这是足以重现:

import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)

data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)

睡眠是只是为了保持脚本运行时我启动其他程序。

The sleep is just to keep the script running while I start the other process.

如果我有此脚本运行的两个实例,上述错误读取拼花文件时显示。当我把 HiveContext SQLContext 一切都很好。

If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.

有谁知道这是为什么?

推荐答案

默认配置单元(上下文)是采用嵌入式Derby作为metastore。它的目的主要是用于测试和仅支持一个活动的用户。如果你想支持多个运行的应用程序,你应该配置独立metastore。这时蜂巢支持的PostgreSQL,MySQL和甲骨文和MySQL。配置的细节取决于后端和期权(本地/远程),但一般来说,你需要:

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:

  • a running RDBMS server
  • a metastore database created using provided scripts
  • a proper Hive configuration

的Cloudera提供了一个COM prehensive指南可能对您有用:<一href=\"https://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_hive_metastore_configure.html\"相对=nofollow>配置蜂巢Metastore 。

Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.

理论上应该是也可以创建单独的德比metastores用适当的配置(参见<一href=\"https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-Local/EmbeddedMetastoreDatabase%28Derby%29\"相对=nofollow>蜂巢管理操作手册 - 本地/嵌入式数据库Metastore ),或使用的德比在服务器模式

Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.

这篇关于多星火应用与HiveContext的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆