似乎无法为蜂巢建立火花 [英] Can't seem to build hive for spark

查看:110
本文介绍了似乎无法为蜂巢建立火花的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图在pyspark中运行此代码.

I have been trying to run this code in pyspark.

sqlContext = HiveContext(sc) 
datumDF = sqlContext.createDataFrame(datumX, schema)

但是已经收到此警告:

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o44))

我登录到AWS并使用以下代码启动集群:/User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i /User/Desktop/pemfile.pem login clustername

I log in to AWS and spin up clusters with this code: /User/Downloads/spark-1.5.2-bin-hadoop2.6/ec2/spark-ec2 -k name -i /User/Desktop/pemfile.pem login clustername

但是,我找到的所有文档都包含此命令,这些命令存在于文件中 /users/downloads/spark-1.5.2/无论如何,我都已运行它们,并尝试在该文件夹中使用ec2命令登录.不过,还是有同样的错误

However I all the docs I've found involve this commands, which exist in the file /users/downloads/spark-1.5.2/ I've run them anyway, and tried logging into was using the ec2 command in that folder after I did. Still, just got the same error

我在本地计算机上运行这些命令之前提交了导出SPARK_HIVE=TRUE,但是我已经看到消息说它已被弃用,无论如何都会被忽略.

I submit export SPARK_HIVE=TRUE before running these commands on my local machine, but I've seen messages saying its deprecated and will be ignored anyway.

使用Maven构建配置单元:

Build hive with maven:

mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 
    -Phive -Phive-thriftserver -DskipTests clean package

使用sbt构建配置单元

Build hive with sbt

 build/sbt -Pyarn -Phadoop-2.3 assembly

我找到了另一个

./sbt/sbt -Phive assembly

我还拿了hive-site.xml file并同时放入了/Users/Downloads/spark-1.5.2-bin-hadoop2.6/conf folder and the /Users/Downloads/spark-1.5.2/conf

I also took the hive-site.xml file and put in both the /Users/Downloads/spark-1.5.2-bin-hadoop2.6/conf folder and the /Users/Downloads/spark-1.5.2/conf

还是没有运气.

无论我用它来构建它还是以何种方式登录,我似乎都无法运行hive命令.有什么明显的遗失之处.

I can't seem to run the hive commands no matter what I build it with or how I log in. Is there anything obvious I'm missing.

推荐答案

在由Spark程序包随附的ec2脚本构建的EC2集群上使用HiveContext时,我也遇到相同的错误(我的v1.5.2案子).经过反复试验,我发现使用以下选项构建EC2集群可以正确构建正确配置Hive的Hadoop版本,以便在PySpark作业中使用HiveContext:

I too had the same error when using a HiveContext on a EC2 cluster built with the ec2 scripts that comes with the Spark package (v1.5.2 in my case). Through much trial and error, I found that building a EC2 cluster with the following options got the right version of Hadoop with Hive properly built so that I can use a HiveContext in my PySpark jobs:

spark-ec2 -k <your key pair name> -i /path/to/identity-file.pem -r us-west-2 -s 2 --instance-type m3.medium --spark-version 1.5.2 --hadoop-major-version yarn  launch <your cluster name>

这里的关键参数是将--spark-version设置为1.5.2,将--hadoop-major-version设置为yarn-即使您不习惯使用Yarn提交作业,因为它强制hadoop构建为2.4.当然,请根据您所需的群集调整其他参数.

The key parameters here is that you set --spark-version to 1.5.2 and --hadoop-major-version to yarn - even though you aren't using to use Yarn to submit jobs as it forces the hadoop build to be 2.4. Of course, adjust the other parameters as appropriate for your desired cluster.

这篇关于似乎无法为蜂巢建立火花的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆