如何对记录数组中的数组中的字段进行分组? [英] How to group by on a field inside an array of an array of records?
问题描述
我有以下架构-
[名称:StringType,成绩:ArrayType( StructType( StructField(subject_grades, ArrayType(StructType(StructField(subject,StringType,false),StructField(grade,LongType,false)]
[name: StringType, grades: ArrayType( StructType( StructField(subject_grades, ArrayType(StructType(StructField(subject,StringType,false), StructField(grade,LongType,false)]
我想在成绩数组内的subject_grades
数组内的主题字段上输入groupby
.
I want to groupby
on the subject field inside the subject_grades
array which is inside the grades array.
我尝试了
sql.sql("select ... from grades_table group by grades.subject_grades.subject")
但我知道
org.apache.spark.sql.AnalysisException: cannot resolve 'grades.subject_grades[subject]' due to data type mismatch: argument 2 requires integral type, however, 'subject' is of string type.;
我理解为什么会出现此错误,但是我希望我可以避免爆炸整个过程,以便在内部区域进行分组.
I understand why I get this error, however I was hoping I could avoid exploding the entire thing in order to group by on the inner field.
推荐答案
很难(相对)使用数组,并且当主查询需要内部元素时,恳求explode
(或flatMap
)使用数组,例如用于分组.
Arrays are (relatively) hard to work with and beg for explode
(or flatMap
) to work with them when the main query requires the elements inside, e.g. for grouping.
我从问题中得到的学习是将以下subject_grades
类型为ArrayType
的子句转换为以subject
作为索引的子句,因此需要整数类型
Something I learnt from the question is that the following clause with subject_grades
being of type ArrayType
is translated to a clause with subject
being the index and hence the requirement of integral type.
group by grades.subject_grades.subject
除了使用explode
(或flatMap
)来分解" subject_grades
数组并进行分组之外,我别无选择.
I'd see no other way but using explode
(or flatMap
) to "destructure" the subject_grades
array and do the grouping.
这篇关于如何对记录数组中的数组中的字段进行分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!