如何对记录数组中的数组中的字段进行分组? [英] How to group by on a field inside an array of an array of records?

查看:153
本文介绍了如何对记录数组中的数组中的字段进行分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下架构-

[名称:StringType,成绩:ArrayType( StructType( StructField(subject_grades, ArrayType(StructType(StructField(subject,StringType,false),StructField(grade,LongType,false)]

[name: StringType, grades: ArrayType( StructType( StructField(subject_grades, ArrayType(StructType(StructField(subject,StringType,false), StructField(grade,LongType,false)]

我想在成绩数组内的subject_grades数组内的主题字段上输入groupby.

I want to groupby on the subject field inside the subject_grades array which is inside the grades array.

我尝试了

sql.sql("select ... from grades_table group by grades.subject_grades.subject") 

但我知道

org.apache.spark.sql.AnalysisException: cannot resolve 'grades.subject_grades[subject]' due to data type mismatch: argument 2 requires integral type, however, 'subject' is of string type.;

我理解为什么会出现此错误,但是我希望我可以避免爆炸整个过程,以便在内部区域进行分组.

I understand why I get this error, however I was hoping I could avoid exploding the entire thing in order to group by on the inner field.

推荐答案

很难(相对)使用数组,并且当主查询需要内部元素时,恳求explode(或flatMap)使用数组,例如用于分组.

Arrays are (relatively) hard to work with and beg for explode (or flatMap) to work with them when the main query requires the elements inside, e.g. for grouping.

我从问题中得到的学习是将以下subject_grades类型为ArrayType的子句转换为以subject作为索引的子句,因此需要整数类型

Something I learnt from the question is that the following clause with subject_grades being of type ArrayType is translated to a clause with subject being the index and hence the requirement of integral type.

group by grades.subject_grades.subject

除了使用explode(或flatMap)来分解" subject_grades数组并进行分组之外,我别无选择.

I'd see no other way but using explode (or flatMap) to "destructure" the subject_grades array and do the grouping.

这篇关于如何对记录数组中的数组中的字段进行分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆