如何使用UDF返回多列? [英] How to use UDF to return multiple columns?

查看:30
本文介绍了如何使用UDF返回多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以创建一个返回列集的 UDF?

Is it possible to create a UDF which would return the set of columns?

即有一个数据框如下:

| Feature1 | Feature2 | Feature 3 |
| 1.3      | 3.4      | 4.5       |

现在我想提取一个新特征,它可以被描述为两个元素的向量(例如,在线性回归中看到 - 斜率和偏移量).所需的数据集应如下所示:

Now I would like to extract a new feature, which can be described as a vector of let's say two elements (e.g. as seen in a linear regression - slope and offset). Desired dataset shall look as follows:

| Feature1 | Feature2 | Feature 3 | Slope | Offset |
| 1.3      | 3.4      | 4.5       | 0.5   | 3      |

是否可以使用单个 UDF 创建多个列,或者我是否需要遵循以下规则:每个 UDF 单个列"?

Is it possible to create multiple columns with single UDF or do I need to follow the rule: "single column per single UDF"?

推荐答案

结构方法

您可以将udf函数定义为

def myFunc: (String => (String, String)) = { s => (s.toLowerCase, s.toUpperCase)}

import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)

并使用 .* 作为

val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select("Feature1", "Feature2", "Feature 3", "newCol.*")

我已经从 udf 函数返回了 Tuple2 用于测试目的(可以根据需要多少多列使用更高阶的元组),它将被视为 <代码>结构列.然后你可以使用 .* 选择单独列中的所有元素,最后重命名它们.

I have returned Tuple2 for testing purpose (higher order tuples can be used according to how many multiple columns are required) from udf function and it would be treated as struct column. Then you can use .* to select all the elements in separate columns and finally rename them.

你应该有输出

+--------+--------+---------+---+---+
|Feature1|Feature2|Feature 3|_1 |_2 |
+--------+--------+---------+---+---+
|1.3     |3.4     |4.5      |3.4|3.4|
+--------+--------+---------+---+---+

您可以重命名 _1_2

数组方法

udf 函数应该返回一个 array

udf function should return an array

def myFunc: (String => Array[String]) = { s => Array("s".toLowerCase, s.toUpperCase)}

import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)

并且您可以选择array的元素并使用alias重命名它们

And the you can select elements of the array and use alias to rename them

val newDF = df.withColumn("newCol", myUDF(df("Feature2"))).select($"Feature1", $"Feature2", $"Feature 3", $"newCol"(0).as("Slope"), $"newCol"(1).as("Offset"))

你应该有

+--------+--------+---------+-----+------+
|Feature1|Feature2|Feature 3|Slope|Offset|
+--------+--------+---------+-----+------+
|1.3     |3.4     |4.5      |s    |3.4   |
+--------+--------+---------+-----+------+

这篇关于如何使用UDF返回多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆