Spark Scala数据帧udf返回行 [英] Spark scala data frame udf returning rows

查看:75
本文介绍了Spark Scala数据帧udf返回行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个数据框,其中包含一列(称为 colA),它是一个 seq 的行.我想为 colA 的每条记录附加一个新字段.(而且新的文件与以前的记录相关联,所以我必须写一个 udf.)这个udf应该怎么写?

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.) How should I write this udf?

我尝试编写一个 udf,它将 colA 作为输入,并输出 Seq[Row],其中每条记录都包含新的字段.但问题是 udf 无法返回 Seq[Row]/异常是不支持类型 org.apache.spark.sql.Row 的架构".我该怎么办?

I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'. What should I do?

我写的udf:<代码>val convert = udf[Seq[Row], Seq[Row]](blablabla...)例外是 java.lang.UnsupportedOperationException: 不支持类型 org.apache.spark.sql.Row 的架构

The udf that I wrote: val convert = udf[Seq[Row], Seq[Row]](blablabla...) And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported

推荐答案

从 spark 2.0 开始,你可以创建返回 Row/Seq[Row] 的 UDF,但你必须提供返回类型的模式,例如如果您使用双打数组:

since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :

val schema = ArrayType(DoubleType)

val myUDF = udf((s: Seq[Row]) => {
  s // just pass data without modification
}, schema)

但我真的无法想象这有什么用处,我宁愿从 UDF 返回元组或案例类(或其中的 Seq).

But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.

如果您的行包含超过 22 个字段(元组/案例类的字段限制),这可能很有用

EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)

这篇关于Spark Scala数据帧udf返回行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆