使用Spark高阶函数时如何返回案例类? [英] How to return a case class when using Spark High Order Functions?
问题描述
我正在尝试使用Spark 转换函数,以将数组的项从ClassA类型转换为ClassB,如下所示:
I am trying to use Spark transform function in order to transform the items of an array from type ClassA into ClassB as shown below:
case class ClassA(a: String, b: String, c: String)
case class ClassB(a: String, b: String)
val a1 = ClassA("a1", "b1", "c1")
val a2 = ClassA("a2", "b2", "c2")
val df = Seq(
(Seq(a1, a2))
).toDF("ClassA")
df.withColumn("ClassB", expr("transform(ClassA, c -> ClassB(c.a, c.b))")).show(false)
尽管以上代码失败,并显示以下消息:
Although the above code fails with the message:
org.apache.spark.sql.AnalysisException:未定义的函数:'ClassB'. 此功能既不是注册的临时功能,也不是 在数据库默认"中注册的永久功能.
org.apache.spark.sql.AnalysisException: Undefined function: 'ClassB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
完成这项工作的唯一方法是通过struct
,如下所示:
The only way to make this work was through struct
as shown next:
df.withColumn("ClassB", expr("transform(ClassA, c -> struct(c.a as string, c.b as string))")).show(false)
// +----------------------------+--------------------+
// |ClassA |ClassB |
// +----------------------------+--------------------+
// |[[a1, b1, c1], [a2, b2, c2]]|[[a1, b1], [a2, b2]]|
// +----------------------------+--------------------+
所以问题是使用transform
时是否有任何方法可以返回case类而不是struct?
So the question is if there is any way to return a case class instead of a struct when using transform
?
推荐答案
transform
表达式是关系性的,对案例类ClassA
和ClassB
一无所知.
拥有AFAIK的唯一方法是注册UDF,以便可以使用结构(或注入函数),但还必须处理"Row
"编码值而不是ClassA(SparkSQL的全部内容是编码:) )就像这样:
The transform
expression is relational and doesn't know anything about case classes ClassA
and ClassB
.
The only way you have AFAIK would be to register an UDF so you can use your structure (or inject functions) but you would also have to deal with a "Row
" encoded value instead of ClassA (SparkSQL is all about encoding :) ) like so :
sparkSession.udf.register("toB", (a: Row) => ClassB(a.getAs[String]("a"), a.getAs[String]("b")))
df.withColumn("ClassB", expr("transform(ClassA, c -> toB(c))")).show(false)
旁注:由于转换读取的是列而不是类型,因此命名列"ClassA"可能会造成混淆.
Side note: Naming your column "ClassA" might be confusing since transform is reading the column, not the type.
这篇关于使用Spark高阶函数时如何返回案例类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!