从 UDF 内的 Spark SQL 行中提取嵌套数组 [英] Extract a nested array from a Spark SQL Row inside a UDF
问题描述
我正在使用 DataFrames,需要提取数据.我有很多嵌套级别,所以我用爆炸和选择做了第一个级别,然后我使用 UDF 嵌套级别.
I'm working with DataFrames and need to extract data. I have many nested levels, so I did the first level with explodes and selects but then I use UDFs for nested levels.
我有一个使用 $"Root.Obj"
的 UDF,它是一个数组,我希望它返回一个 Array[MyObj].
我的输出类:
I have an UDF taking $"Root.Obj"
, which is an array, and I want it to return an Array[MyObj].
My output classes :
case class MyObj(fieldA: Boolean, fieldB: String, fieldC: Array[MyNested])
case class MyNested(field1: Long, field2: String)
简而言之,这是输入模式:
This is the input schema in short :
|-- Obj: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FieldA: boolean (nullable = true)
| | |-- FieldB: string (nullable = true)
| | |-- FieldC: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Field1: long (nullable = true)
| | | | |-- Field2: string (nullable = true)
| | |-- FieldD: boolean (nullable = true)
我的 UDF:
def extractObjs: UserDefinedFunction = udf {
(objs: Seq[Row]) ⇒
objs.map {
obj ⇒
MyObj(
obj.getAs[Boolean]("FieldA"),
obj.getAs[String]("FieldB"),
extractNested(obj.get???("FieldC"))
)
}
}
def extractNested(nesteds: ???): Array[MyNested] = {
???
}
这是更复杂的 IRL,因为我需要从其他地方检索值并且有更多的嵌套数组.此外,Obj 和 FieldC 的输入结构比这里复杂得多,我不能(或不想)为它们创建案例类.由于我需要在多个地方执行此操作,因此假设我不知道 FieldC 元素的结构".
This is more complicated IRL as I need to retrieve values from other places and there is more nested arrays. Moreover, the input structures of Obj and FieldC are far more complex than here and I can't (or don't want) create case class for them. As I'll need to do this in over places, let's say I don't know "the struct" of FieldC elements.
我的问题是提取FieldC"数组.我想要一个 Seq[Row] 但我无法得到它,getStruct 只给我一个 Row,并且 getSeq[Row] 在之后抛出错误,因为 scala.collection.mutable.WrappedArray$ofRef 不能被强制转换到 org.apache.spark.sql.Row
.
My problem is to extract the "FieldC" array. I would like a Seq[Row] but I couldn't achieve to get that, getStruct give me just a Row, and getSeq[Row] is throwing error after because scala.collection.mutable.WrappedArray$ofRef cannot be cast to org.apache.spark.sql.Row
.
推荐答案
结构映射到 UDF 中的 Row
,因此结构数组可以通过 Seq[Row]
,例如:
Structs map to Row
in an UDF, so array of structs can be accessed by Seq[Row]
, such as :
def extractObjs: UserDefinedFunction = udf {
(objs: Seq[Row]) ⇒
objs.map {
obj ⇒
MyObj(
obj.getAs[Boolean]("FieldA"),
obj.getAs[String]("FieldB"),
extractNested(obj.getAs[Seq[Row]]("FieldC"))
)
}
}
def extractNested(nesteds: Seq[Row]): Array[MyNested] = {
nesteds.map(r => MyNested(r.getAs[Long]("Field1"),r.getAs[String]("Field2"))).toArray
}
这篇关于从 UDF 内的 Spark SQL 行中提取嵌套数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!