似乎无法为火花构建蜂巢 [英] Can't seem to build hive for spark

查看:28
本文介绍了似乎无法为火花构建蜂巢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试在 pyspark 中运行此代码.

I have been trying to run this code in pyspark.

sqlContext = HiveContext(sc) 
datumDF = sqlContext.createDataFrame(datumX, schema)

但是一直收到这个警告:

But have been receiving this warning:

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o44))

我登录 AWS 并使用以下代码启动集群:/User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i/User/Desktop/pemfile.pem 登录集群名

I log in to AWS and spin up clusters with this code: /User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i /User/Desktop/pemfile.pem login clustername

但是我发现的所有文档都涉及此命令,这些命令存在于文件中/users/downloads/spark-1.5.2/ 无论如何,我已经运行了它们,并在我这样做后尝试使用该文件夹中的 ec2 命令登录.仍然,只是得到了同样的错误

However I all the docs I've found involve this commands, which exist in the file /users/downloads/spark-1.5.2/ I've run them anyway, and tried logging into was using the ec2 command in that folder after I did. Still, just got the same error

我在本地机器上运行这些命令之前提交了 export SPARK_HIVE=TRUE,但我看到消息说它已弃用并且无论如何都会被忽略.

I submit export SPARK_HIVE=TRUE before running these commands on my local machine, but I've seen messages saying its deprecated and will be ignored anyway.

使用 maven 构建 hive:

Build hive with maven:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 
    -Phive -Phive-thriftserver -DskipTests clean package

用 sbt 构建 hive

Build hive with sbt

 build/sbt -Pyarn -Phadoop-2.3 assembly

我发现的另一个

./sbt/sbt -Phive assembly

我还获取了 hive-site.xml 文件 并将 /Users/Downloads/spark-1.5.2-bin-hadoop2.6/conf 文件夹和/用户/下载/spark-1.5.2/conf

仍然没有运气.

无论我使用什么构建它或我如何登录,我似乎都无法运行 hive 命令.有什么明显我遗漏的地方.

I can't seem to run the hive commands no matter what I build it with or how I log in. Is there anything obvious I'm missing.

推荐答案

在使用 Spark 包随附的 ec2 脚本构建的 EC2 集群上使用 HiveContext 时,我也遇到了同样的错误(在我的情况下是 v1.5.2).通过多次反复试验,我发现使用以下选项构建 EC2 集群可以得到正确版本的 Hadoop 和正确构建的 Hive,以便我可以在我的 PySpark 作业中使用 HiveContext:

I too had the same error when using a HiveContext on a EC2 cluster built with the ec2 scripts that comes with the Spark package (v1.5.2 in my case). Through much trial and error, I found that building a EC2 cluster with the following options got the right version of Hadoop with Hive properly built so that I can use a HiveContext in my PySpark jobs:

spark-ec2 -k <your key pair name> -i /path/to/identity-file.pem -r us-west-2 -s 2 --instance-type m3.medium --spark-version 1.5.2 --hadoop-major-version yarn  launch <your cluster name>

这里的关键参数是你把--spark-version设为1.5.2,--hadoop-major-version设为yarn> - 即使您不使用 Yarn 提交作业,因为它会强制 hadoop 构建为 2.4.当然,根据您所需的集群调整其他参数.

The key parameters here is that you set --spark-version to 1.5.2 and --hadoop-major-version to yarn - even though you aren't using to use Yarn to submit jobs as it forces the hadoop build to be 2.4. Of course, adjust the other parameters as appropriate for your desired cluster.

这篇关于似乎无法为火花构建蜂巢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆