Apache Spark-将UDF的结果分配给多个数据框列 [英] Apache Spark -- Assign the result of UDF to multiple dataframe columns

查看：124 发布时间：2020/9/4 0:04:45 python apache-spark pyspark apache-spark-sql user-defined-functions

本文介绍了Apache Spark-将UDF的结果分配给多个数据框列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用pyspark，使用spark-csv将大型csv文件加载到数据帧中，并且作为预处理步骤，我需要对其中一列中的可用数据进行多种操作(其中包含json字符串).这将返回X值，每个值都需要存储在自己的单独列中.

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values, each of which needs to be stored in their own separate column.

该功能将在UDF中实现.但是，我不确定如何从该UDF返回值列表并将其输入到各个列中.下面是一个简单的示例:

That functionality will be implemented in a UDF. However, I am not sure how to return a list of values from that UDF and feed these into individual columns. Below is a simple example:

(...)
from pyspark.sql.functions import udf
def udf_test(n):
    return [n/2, n%2]

test_udf=udf(udf_test)


df.select('amount','trans_date').withColumn("test", test_udf("amount")).show(4)

这将产生以下结果:

+------+----------+--------------------+
|amount|trans_date|                test|
+------+----------+--------------------+
|  28.0|2016-02-07|         [14.0, 0.0]|
| 31.01|2016-02-07|[15.5050001144409...|
| 13.41|2016-02-04|[6.70499992370605...|
| 307.7|2015-02-17|[153.850006103515...|
| 22.09|2016-02-05|[11.0450000762939...|
+------+----------+--------------------+
only showing top 5 rows

将udf返回的两个(在此示例中)值存储在单独的列中的最佳方法是什么?现在，它们被键入为字符串:

What would be the best way to store the two (in this example) values being returned by the udf on separate columns? Right now they are being typed as strings:

df.select('amount','trans_date').withColumn("test", test_udf("amount")).printSchema()

root
 |-- amount: float (nullable = true)
 |-- trans_date: string (nullable = true)
 |-- test: string (nullable = true)

推荐答案

无法通过单个UDF调用创建多个顶级列，但可以创建一个新的struct.它需要具有指定returnType:

It is not possible to create multiple top level columns from a single UDF call but you can create a new struct. It requires an UDF with specified returnType:

from pyspark.sql.functions import udf
from pyspark.sql.types import *

schema = StructType([
    StructField("foo", FloatType(), False),
    StructField("bar", FloatType(), False)
])

def udf_test(n):
    return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))

test_udf = udf(udf_test, schema)
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])

foobars = df.select(test_udf("y").alias("foobar"))
foobars.printSchema()
## root
##  |-- foobar: struct (nullable = true)
##  |    |-- foo: float (nullable = false)
##  |    |-- bar: float (nullable = false)

您可以使用简单的select进一步展平架构:

You further flatten the schema with simple select:

foobars.select("foobar.foo", "foobar.bar").show()
## +---+---+
## |foo|bar|
## +---+---+
## |1.0|0.0|
## |1.5|1.0|
## +---+---+

另请参见从Spark DataFrame中的单个列派生多个列

这篇关于Apache Spark-将UDF的结果分配给多个数据框列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark-将UDF的结果分配给多个数据框列 [英] Apache Spark -- Assign the result of UDF to multiple dataframe columns

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Apache Spark-将UDF的结果分配给多个数据框列 [英] Apache Spark -- Assign the result of UDF to multiple dataframe columns

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭