py4j.protocol.Py4JJavaError 使用选择语句在数据框中选择嵌套列时出错 [英] py4j.protocol.Py4JJavaError when selecting nested column in dataframe using select statetment

查看:19
本文介绍了py4j.protocol.Py4JJavaError 使用选择语句在数据框中选择嵌套列时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 spark 数据帧 (python) 中执行一个简单的任务,即通过从另一个数据帧中选择特定列和嵌套列来创建新的数据帧例如:

I'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example :

df.printSchema()
root
 |-- time_stamp: long (nullable = true)
 |-- country: struct (nullable = true)
 |    |-- code: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- time_zone: string (nullable = true)
 |-- event_name: string (nullable = true)
 |-- order: struct (nullable = true)
 |    |-- created_at: string (nullable = true)
 |    |-- creation_type: struct (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |-- destination: struct (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- ordering_user: struct (nullable = true)
 |    |    |-- cancellation_score: long (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- is_test: boolean (nullable = true)

df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as   order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")

我收到以下错误:

Traceback (most recent call last):
  File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
    df2=sqlContext.sql(q)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
  File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order' found

flat_order_creation.order.destination.state as order_destination_state,

我在本地模式下使用带有 master 的 spark-submit 来运行此代码.重要的是要提到当我连接到 pyspark shell 并运行代码(逐行)时,它可以工作,但是在提交时(即使在本地模式下)它也失败了.another thing is important to mention is that when selecting a non nested field it works as well.我在 EMR(版本 4.2.0)上使用 spark 1.5.2

I'm using spark-submit with master in local mode to run the this code. it important to mention the when I'm connecting to pyspark shell and run the code (line by line) it works , but when submitting it (even in local mode) it fails. another thing is important to mention is that when selecting a non nested field it works as well. I'm using spark 1.5.2 on EMR (version 4.2.0)

推荐答案

如果没有一个最小、完整和可验证的示例,我可以只是猜测,但看起来您在交互式 shell 和独立程序中使用了不同的 SparkContext 实现.

Without a Minimal, Complete, and Verifiable example I can only guess but it looks like you're using different SparkContext implementations in the interactive shell and your standalone program.

只要使用 Hive 支持构建 Spark 二进制文件,shell 中提供的 sqlContext 就是一个 HiveContext.除了其他差异之外,它提供了比普通 SQLContext 更复杂的 SQL 解析器.您可以按如下方式轻松重现您的问题:

As long as Spark binaries have been build with Hive support sqlContext provided in the shell is a HiveContext. Among other differences it provides more sophisticated SQL parser than a plain SQLContext. You can easily reproduce your problem as follows:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext

val conf: SparkConf = ???
val sc: SparkContext = ???
val query = "SELECT df.foobar.order FROM df"

val hiveContext: SQLContext = new HiveContext(sc)
val sqlContext: SQLContext = new SQLContext(sc)
val json = sc.parallelize(Seq("""{"foobar": {"order": 1}}"""))

sqlContext.read.json(json).registerTempTable("df")
sqlContext.sql(query).show
// java.lang.RuntimeException: [1.18] failure: ``*'' expected but `order' found
// ...

hiveContext.read.json(json).registerTempTable("df")
hiveContext.sql(query)
// org.apache.spark.sql.DataFrame = [order: bigint]

在独立程序中使用 HiveContext 初始化 sqlContext 应该可以解决问题:

Initialization sqlContext with HiveContext in the standalone program should do the trick:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc) 

df = sqlContext.createDataFrame(...)
df.registerTempTable("flat_order_creation")

sqlContext.sql(...)

重要的是要注意问题不是嵌套本身,而是使用 ORDER 关键字作为列名.因此,如果使用 HiveContext 不是一种选择,只需将该字段的名称更改为其他名称即可.

It is important to note that problem is not nesting itself but using ORDER keyword as a column name. So if using HiveContext is not an option just change a name of the field to something else.

这篇关于py4j.protocol.Py4JJavaError 使用选择语句在数据框中选择嵌套列时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆