如何使用UDF添加多列? [英] How to add multiple columns using UDF?

查看:38
本文介绍了如何使用UDF添加多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我想将 UDF 的返回值添加到单独列中的现有数据框.我如何以一种足智多谋的方式实现这一目标?

I want to add the return values of a UDF to an existing dataframe in seperate columns. How do I achieve this in a resourceful way?

这是我目前所拥有的示例.

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType  

df = spark.createDataFrame([("Alive",4)],["Name","Number"])
df.show(1)

+-----+------+
| Name|Number|
+-----+------+
|Alive|     4|
+-----+------+

def example(n):
        return [[n+2], [n-2]]

#  schema = StructType([
#          StructField("Out1", ArrayType(IntegerType()), False),
#          StructField("Out2", ArrayType(IntegerType()), False)])

example_udf = udf(example)

现在我可以按如下方式向数据框中添加一列

Now I can add a column to the dataframe as follows

newDF = df.withColumn("Output", example_udf(df["Number"]))
newDF.show(1)
+-----+------+----------+
| Name|Number|Output    |
+-----+------+----------+
|Alive|     4|[[6], [2]]|
+-----+------+----------+

但是我不希望这两个值在同一列中,而是在不同的列中.

However I don't want the two values to be in the same column but rather in separate ones.

理想情况下,我想现在拆分输出列以避免调用示例函数两次(每个返回值一次),如此处here,但是在我的情况下,我得到了一个数组数组,我看不到拆分在那里是如何工作的(请注意,每个数组将包含多个值,用一个,".

Ideally I'd like to split the output column now to avoid calling the example function two times (once for each return value) as explained here and here, however in my situation I'm getting an array of arrays and I can't see how a split would work there (please note that each array will contain multiple values, separated with a ",".

结果应该是什么样子

我最终想要的是这个

+-----+------+----+----+
| Name|Number|Out1|Out2|
+-----+------+----+----+
|Alive|     4|   6|   2|
+-----+------+----+----+

请注意,StructType 返回类型的使用是可选的,不一定是解决方案的一部分.

Note that the use of the StructType return type is optional and doesn't necessarily have to be part of the solution.

我注释掉了 StructType 的使用(并编辑了 udf 分配),因为它不是示例函数的返回类型所必需的.但是,如果返回值类似于

I commented out the use of StructType (and edited the udf assignment) since it's not necessary for the return type of the example function. However it has to be used if the return value would be something like

return [6,3,2],[4,3,1]

推荐答案

返回一个StructType,只需使用Row

df = spark.createDataFrame([("Alive", 4)], ["Name", "Number"])


def example(n):
    return Row('Out1', 'Out2')(n + 2, n - 2)


schema = StructType([
    StructField("Out1", IntegerType(), False),
    StructField("Out2", IntegerType(), False)])

example_udf = f.UserDefinedFunction(example, schema)

newDF = df.withColumn("Output", example_udf(df["Number"]))
newDF = newDF.select("Name", "Number", "Output.*")

newDF.show(truncate=False)

这篇关于如何使用UDF添加多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆