使用Spark高阶函数时如何返回案例类? [英] How to return a case class when using Spark High Order Functions?

查看:90
本文介绍了使用Spark高阶函数时如何返回案例类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Spark 转换函数,以将数组的项从ClassA类型转换为ClassB,如下所示:

I am trying to use Spark transform function in order to transform the items of an array from type ClassA into ClassB as shown below:

case class ClassA(a: String, b: String, c: String)
case class ClassB(a: String, b: String)

val a1 = ClassA("a1", "b1", "c1")
val a2 = ClassA("a2", "b2", "c2")

val df = Seq(
(Seq(a1, a2))
).toDF("ClassA")

df.withColumn("ClassB", expr("transform(ClassA, c -> ClassB(c.a, c.b))")).show(false)

尽管以上代码失败,并显示以下消息:

Although the above code fails with the message:

org.apache.spark.sql.AnalysisException:未定义的函数:'ClassB'. 此功能既不是注册的临时功能,也不是 在数据库默认"中注册的永久功能.

org.apache.spark.sql.AnalysisException: Undefined function: 'ClassB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.

完成这项工作的唯一方法是通过struct,如下所示:

The only way to make this work was through struct as shown next:

df.withColumn("ClassB", expr("transform(ClassA, c -> struct(c.a as string, c.b as string))")).show(false)

// +----------------------------+--------------------+
// |ClassA                      |ClassB              |
// +----------------------------+--------------------+
// |[[a1, b1, c1], [a2, b2, c2]]|[[a1, b1], [a2, b2]]|
// +----------------------------+--------------------+

所以问题是使用transform时是否有任何方法可以返回case类而不是struct?

So the question is if there is any way to return a case class instead of a struct when using transform?

推荐答案

transform表达式是关系性的,对案例类ClassAClassB一无所知. 拥有AFAIK的唯一方法是注册UDF,以便可以使用结构(或注入函数),但还必须处理"Row"编码值而不是ClassA(SparkSQL的全部内容是编码:) )就像这样:

The transform expression is relational and doesn't know anything about case classes ClassA and ClassB. The only way you have AFAIK would be to register an UDF so you can use your structure (or inject functions) but you would also have to deal with a "Row" encoded value instead of ClassA (SparkSQL is all about encoding :) ) like so :

sparkSession.udf.register("toB", (a: Row) => ClassB(a.getAs[String]("a"), a.getAs[String]("b")))

df.withColumn("ClassB", expr("transform(ClassA, c -> toB(c))")).show(false)

旁注:由于转换读取的是列而不是类型,因此命名列"ClassA"可能会造成混淆.

Side note: Naming your column "ClassA" might be confusing since transform is reading the column, not the type.

这篇关于使用Spark高阶函数时如何返回案例类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆