带有嵌套列的 Apache Spark 窗口函数 [英] Apache Spark Window function with nested column

查看:34
本文介绍了带有嵌套列的 Apache Spark 窗口函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定这是一个错误(或者只是不正确的语法).我四处搜索,没有在其他地方看到这一点,所以我在提交错误报告之前在这里询问.

I'm not sure this is a bug (or just incorrect syntax). I searched around and didn't see this mentioned elsewhere so I'm asking here before filing a bug report.

我正在尝试使用在嵌套列上分区的 Window 函数.我在下面创建了一个小例子来演示这个问题.

I'm trying to use a Window function partitioned on a nested column. I've created a small example below demonstrating the problem.

import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")
  .withColumn("Data", struct("A", "B", "C")).drop("A").drop("B").drop("C")
val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
data.select($"*", max("num").over(winSpec) as "max").where("num = max").drop("max").show

以上导致错误org.apache.spark.sql.AnalysisException:已解析的属性 A#39、B#40 从运算符 !Project [num#33、Data#37、A#39 中的 num#33、Data#37 中丢失,B#40];在 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)在 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)...

如果这些列不是嵌套的,它就可以正常工作.我是否遗漏了一些语法,或者这是一个错误?

If instead those columns aren't nested, it works fine. Am I missing something with the syntax, or is this a bug?

推荐答案

在我看来,当分析器尝试扩展 *

It looks to me like you are hitting a bug when the analyzer is trying to expand the *

import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

sql("SET spark.sql.eagerAnalysis=false") // Let us see the error even though we are constructing an invalid tree

val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")
  .withColumn("Data", struct("A", "B", "C"))
  .drop("A")
  .drop("B")
  .drop("C")

val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
data.select($"*", max("num").over(winSpec) as "max").explain(true)

通过关闭预先分析(这样我们就可以调用explain 而不会抛出错误),您可以看到*"正在扩展以包含实际不可用的列:

By turning off eager analysis (so that we can call explain without it throwing an error) you can see that the "*" is getting expanded to include columns that aren't actually available:

== Parsed Logical Plan ==
'Project [*,'max('num) windowspecdefinition('Data.A,'Data.B,'num DESC,UnspecifiedFrame) AS max#64928]
+- Project [num#64926,Data#64927]
   +- Project [C#64925,num#64926,Data#64927]
      +- Project [B#64924,C#64925,num#64926,Data#64927]
         +- Project [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS Data#64927]
            +- Project [_1#64919 AS A#64923,_2#64920 AS B#64924,_3#64921 AS C#64925,_4#64922 AS num#64926]
               +- LocalRelation [_1#64919,_2#64920,_3#64921,_4#64922], [[a,b,c,3],[c,b,a,3]]

== Analyzed Logical Plan ==
num: int, Data: struct<A:string,B:string,C:string>, max: int
Project [num#64926,Data#64927,max#64928]
+- Project [num#64926,Data#64927,A#64932,B#64933,max#64928,max#64928]
   +- Window [num#64926,Data#64927,A#64932,B#64933], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMax(num#64926) windowspecdefinition(A#64932,B#64933,num#64926 DESC,RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS max#64928], [A#64932,B#64933], [num#64926 DESC]
      +- !Project [num#64926,Data#64927,A#64932,B#64933]
         +- Project [num#64926,Data#64927]
            +- Project [C#64925,num#64926,Data#64927]
               +- Project [B#64924,C#64925,num#64926,Data#64927]
                  +- Project [A#64923,B#64924,C#64925,num#64926,struct(A#64923,B#64924,C#64925) AS Data#64927]
                     +- Project [_1#64919 AS A#64923,_2#64920 AS B#64924,_3#64921 AS C#64925,_4#64922 AS num#64926]
                        +- LocalRelation [_1#64919,_2#64920,_3#64921,_4#64922], [[a,b,c,3],[c,b,a,3]]

我已在此处提交此文件:https://issues.apache.org/jira/browse/SPARK-12989.如果您手动列出列而不是使用应该作为解决方法的 *.

I've filed this here: https://issues.apache.org/jira/browse/SPARK-12989. If you manually list out the columns instead of using a * that should act as a workaround.

这篇关于带有嵌套列的 Apache Spark 窗口函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆