如何返回“元组类型"在 PySpark 的 UDF 中? [英] How to return a "Tuple type" in a UDF in PySpark?

查看：28 发布时间：2021/11/14 21:29:25 python apache-spark dataframe pyspark apache-spark-sql

本文介绍了如何返回“元组类型"在 PySpark 的 UDF 中?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

中的所有数据类型pyspark.sql.types 是:

__all__ = [
    "DataType", "NullType", "StringType", "BinaryType", "BooleanType", "DateType",
    "TimestampType", "DecimalType", "DoubleType", "FloatType", "ByteType", "IntegerType",
    "LongType", "ShortType", "ArrayType", "MapType", "StructField", "StructType"]

我必须编写一个返回元组数组的 UDF(在 pyspark 中).我给它的第二个参数是什么，它是 udf 方法的返回类型?这将是 ArrayType(TupleType())...

I have to write a UDF (in pyspark) which returns an array of tuples. What do I give the second argument to it which is the return type of the udf method? It would be something on the lines of ArrayType(TupleType())...

推荐答案

Spark 中没有 TupleType 这样的东西.产品类型表示为带有特定类型字段的 structs.例如，如果你想返回一个数组对(整数，字符串)，你可以使用这样的模式:

There is no such thing as a TupleType in Spark. Product types are represented as structs with fields of specific type. For example if you want to return an array of pairs (integer, string) you can use schema like this:

from pyspark.sql.types import *

schema = ArrayType(StructType([
    StructField("char", StringType(), False),
    StructField("count", IntegerType(), False)
]))

示例用法:

from pyspark.sql.functions import udf
from collections import Counter

char_count_udf = udf(
    lambda s: Counter(s).most_common(),
    schema
)

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["id", "value"])

df.select("*", char_count_udf(df["value"])).show(2, False)

## +---+-----+-------------------------+
## |id |value|PythonUDF#<lambda>(value)|
## +---+-----+-------------------------+
## |1  |foo  |[[o,2], [f,1]]           |
## |2  |bar  |[[r,1], [a,1], [b,1]]    |
## +---+-----+-------------------------+

这篇关于如何返回“元组类型"在 PySpark 的 UDF 中?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何返回“元组类型"在 PySpark 的 UDF 中? [英] How to return a "Tuple type" in a UDF in PySpark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何返回“元组类型"在 PySpark 的 UDF 中? [英] How to return a &quot;Tuple type&quot; in a UDF in PySpark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何返回“元组类型"在 PySpark 的 UDF 中? [英] How to return a "Tuple type" in a UDF in PySpark?

登录关闭