如何使用UDF添加多列? [英] How to add multiple columns using UDF?

查看:294
本文介绍了如何使用UDF添加多列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我想将UDF的返回值添加到单独列中的现有数据帧.如何以机智的方式实现这一目标?

I want to add the return values of a UDF to an existing dataframe in seperate columns. How do I achieve this in a resourceful way?

以下是我到目前为止的例子.

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType  

df = spark.createDataFrame([("Alive",4)],["Name","Number"])
df.show(1)

+-----+------+
| Name|Number|
+-----+------+
|Alive|     4|
+-----+------+

def example(n):
        return [[n+2], [n-2]]

#  schema = StructType([
#          StructField("Out1", ArrayType(IntegerType()), False),
#          StructField("Out2", ArrayType(IntegerType()), False)])

example_udf = udf(example)

现在我可以按如下所示向数据框添加一列

Now I can add a column to the dataframe as follows

newDF = df.withColumn("Output", example_udf(df["Number"]))
newDF.show(1)
+-----+------+----------+
| Name|Number|Output    |
+-----+------+----------+
|Alive|     4|[[6], [2]]|
+-----+------+----------+

但是我不希望两个值都在同一列中,而是在单独的列中.

However I don't want the two values to be in the same column but rather in separate ones.

理想情况下,我现在想拆分输出列,以避免两次调用示例函数(每个返回值一次),如,但是在我的情况下,我得到了一个数组数组,但看不到拆分将如何工作(请注意,每个数组将包含多个值,用一个,".

Ideally I'd like to split the output column now to avoid calling the example function two times (once for each return value) as explained here and here, however in my situation I'm getting an array of arrays and I can't see how a split would work there (please note that each array will contain multiple values, separated with a ",".

结果应如何显示

我最终想要的是

+-----+------+----+----+
| Name|Number|Out1|Out2|
+-----+------+----+----+
|Alive|     4|   6|   2|
+-----+------+----+----+

请注意,使用StructType返回类型是可选的,不一定是解决方案的一部分.

Note that the use of the StructType return type is optional and doesn't necessarily have to be part of the solution.

我注释掉了StructType的使用(并编辑了udf赋值),因为示例函数的返回类型不是必需的.但是,如果返回值类似于

I commented out the use of StructType (and edited the udf assignment) since it's not necessary for the return type of the example function. However it has to be used if the return value would be something like

return [6,3,2],[4,3,1]

推荐答案

要返回StructType,只需使用Row

df = spark.createDataFrame([("Alive", 4)], ["Name", "Number"])


def example(n):
    return Row('Out1', 'Out2')(n + 2, n - 2)


schema = StructType([
    StructField("Out1", IntegerType(), False),
    StructField("Out2", IntegerType(), False)])

example_udf = f.UserDefinedFunction(example, schema)

newDF = df.withColumn("Output", example_udf(df["Number"]))
newDF = newDF.select("Name", "Number", "Output.*")

newDF.show(truncate=False)

这篇关于如何使用UDF添加多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆