如何在记录数组的数组中的字段上分组? [英] How to group by on a field inside an array of an array of records?

查看:34
本文介绍了如何在记录数组的数组中的字段上分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下架构 -

[名称:StringType,成绩:ArrayType(结构类型(StructField(subject_grades,ArrayType(StructType(StructField(subject,StringType,false), StructField(grade,LongType,false)]

[name: StringType, grades: ArrayType( StructType( StructField(subject_grades, ArrayType(StructType(StructField(subject,StringType,false), StructField(grade,LongType,false)]

我想 groupbysubject_grades 数组内的主题字段上,该数组位于成绩数组内.

I want to groupby on the subject field inside the subject_grades array which is inside the grades array.

我试过了

sql.sql("select ... from grades_table group by grades.subject_grades.subject") 

但我明白了

org.apache.spark.sql.AnalysisException: cannot resolve 'grades.subject_grades[subject]' due to data type mismatch: argument 2 requires integral type, however, 'subject' is of string type.;

我明白为什么我会收到这个错误,但是我希望我可以避免爆炸整个事物以便在内部领域分组.

I understand why I get this error, however I was hoping I could avoid exploding the entire thing in order to group by on the inner field.

推荐答案

数组(相对)难以使用,并且要求 explode(或 flatMap)工作当主查询需要内部元素时使用它们,例如用于分组.

Arrays are (relatively) hard to work with and beg for explode (or flatMap) to work with them when the main query requires the elements inside, e.g. for grouping.

我从问题中学到的是,以下带有 subject_grades 类型为 ArrayType 的子句被转换为带有 的子句subject 是索引,因此是整数类型的要求.

Something I learnt from the question is that the following clause with subject_grades being of type ArrayType is translated to a clause with subject being the index and hence the requirement of integral type.

group by grades.subject_grades.subject

除了使用 explode(或 flatMap)来解构"subject_grades 数组并进行分组之外,我看不出其他方法.

I'd see no other way but using explode (or flatMap) to "destructure" the subject_grades array and do the grouping.

这篇关于如何在记录数组的数组中的字段上分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆