py4j.protocol.Py4JJavaError在数据框中选择嵌套列时使用select statetment [英] py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

查看:3561
本文介绍了py4j.protocol.Py4JJavaError在数据框中选择嵌套列时使用select statetment的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想执行的火花数据框(蟒蛇),它是通过从其他数据框选择特定的列和嵌套列创建新的数据框一个简单的任务
例如:

I'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example :

df.printSchema()
root
 |-- time_stamp: long (nullable = true)
 |-- country: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- time_zone: string (nullable = true)
 |-- event_name: string (nullable = true)
 |-- order: struct (nullable = true)
 |    |-- created_at: string (nullable = true)
 |    |-- creation_type: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |-- destination: struct (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- ordering_user: struct (nullable = true)
 |    |    |-- cancellation_score: long (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- is_test: boolean (nullable = true)

df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as   order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")

和我得到以下错误:

Traceback (most recent call last):
  File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
    df2=sqlContext.sql(q)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order' found

flat_order_creation.order.destination.state as order_destination_state,

我使用与主以本地模式火花提交运行该code。
当我连接到pyspark程序,并运行code(逐行)它的工作原理,但在提交(即使在本地模式)时,必须提及的失败。
另一件事是重要的是要提的是,选择一个非嵌套字段时,它的工作原理也是如此。
我使用的火花1.5.2对EMR(4.2.0版本)

I'm using spark-submit with master in local mode to run the this code. it important to mention the when I'm connecting to pyspark shell and run the code (line by line) it works , but when submitting it (even in local mode) it fails. another thing is important to mention is that when selecting a non nested field it works as well. I'm using spark 1.5.2 on EMR (version 4.2.0)

推荐答案

最小的,完整的可验证的例子我可以只能猜测,但它看起来像你使用交互式shell和你的独立程序不同的 SparkContext 的实现。

Without a Minimal, Complete, and Verifiable example I can only guess but it looks like you're using different SparkContext implementations in the interactive shell and your standalone program.

只要星火二进制文件已建立与shell提供支持蜂房 sqlContext HiveContext 。在其他方面的差异提供比一个普通 SQLContext 更复杂的SQL语法分析程序。您可以轻松地再现您的问题如下:

As long as Spark binaries have been build with Hive support sqlContext provided in the shell is a HiveContext. Among other differences it provides more sophisticated SQL parser than a plain SQLContext. You can easily reproduce your problem as follows:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext

val conf: SparkConf = ???
val sc: SparkContext = ???
val query = "SELECT df.foobar.order FROM df"

val hiveContext: SQLContext = new HiveContext(sc)
val sqlContext: SQLContext = new SQLContext(sc)
val json = sc.parallelize(Seq("""{"foobar": {"order": 1}}"""))

sqlContext.read.json(json).registerTempTable("df")
sqlContext.sql(query).show
// java.lang.RuntimeException: [1.18] failure: ``*'' expected but `order' found
// ...

hiveContext.read.json(json).registerTempTable("df")
hiveContext.sql(query)
// org.apache.spark.sql.DataFrame = [order: bigint]

初​​始化 sqlContext HiveContext 中的独立程序应该做的伎俩:

Initialization sqlContext with HiveContext in the standalone program should do the trick:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc) 

df = sqlContext.createDataFrame(...)
df.registerTempTable("flat_order_creation")

sqlContext.sql(...)

要注意的问题是不筑巢本身,而是使用订单关键字作为列名是很重要的。因此,如果使用 HiveContext 不是一个选项,只是改变字段的名称到别的东西。

It is important to note that problem is not nesting itself but using ORDER keyword as a column name. So if using HiveContext is not an option just change a name of the field to something else.

这篇关于py4j.protocol.Py4JJavaError在数据框中选择嵌套列时使用select statetment的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆