在 Zeppelin 中使用 %pyspark 解释器注册表时,我无法访问 %sql 中的表 [英] When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql

查看:27
本文介绍了在 Zeppelin 中使用 %pyspark 解释器注册表时,我无法访问 %sql 中的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Zeppelin 0.5.5.我在这里为 python 找到了这个代码/示例,因为我无法使用 %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/.我感觉他的 %pyspark 示例有效,因为如果您使用原始的 %spark zeppelin 教程,银行"表已经创建.

此代码在笔记本中.

%pyspark从 os 导入 getcwd# sqlContext = SQLContext(sc) # 删除了我测试过的最新版本zeppelinHome = getcwd()bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("教育",StringType(), False),StructField("balance", IntegerType(), False)])银行 = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5])))bankdf = sqlContext.createDataFrame(bank,bankSchema)bankdf.registerAsTable("银行")

此代码在同一个笔记本中,但在不同的工作台上.

%sqlSELECT count(1) FROM 银行org.apache.spark.sql.AnalysisException:没有这样的表库;第 1 行 pos 21...

解决方案

我发现了这个问题的问题.在 0.6.0 之前,sqlContext 变量是 %pyspark 中的 sqlc.

缺陷可以在这里找到:https://issues.apache.org/jira/浏览/ZEPPELIN-134

<块引用>

在 Pyspark 中,SQLContext 目前在变量名 sqlc 中可用.这与文档和 scala 中的变量名称 sqlContext 不一致.

sqlContext 可以作为 SQLContext 的变量,除了 sqlc(为了向后兼容)

相关代码:https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66

建议的解决方法是在您的 %pyspark 脚本中执行以下操作

sqlContext = sqlc

在这里找到:

https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.comI am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.

This code is in a notebook.

%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(),     False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")

This code is in the same notebook but different work pad.

%sql 
SELECT count(1) FROM bank

org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...

解决方案

I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.

Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134

In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.

sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)

Related code: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66

The suggested workaround is simply to do the following in your %pyspark script

sqlContext = sqlc

Found here:

https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E

这篇关于在 Zeppelin 中使用 %pyspark 解释器注册表时,我无法访问 %sql 中的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆