如何在 PySpark 中创建一个返回字符串数组的 udf? [英] How to create a udf in PySpark which returns an array of strings?
问题描述
我有一个返回字符串列表的 udf.这应该不会太难.我在执行 udf 时传入数据类型,因为它返回一个字符串数组:ArrayType(StringType)
.
I have a udf which returns a list of strings. this should not be too hard. I pass in the datatype when executing the udf since it returns an array of strings: ArrayType(StringType)
.
现在,不知何故这是行不通的:
Now, somehow this is not working:
我正在操作的数据帧是 df_subsets_concat
,看起来像这样:
the dataframe i'm operating on is df_subsets_concat
and looks like this:
df_subsets_concat.show(3,False)
+----------------------+
|col1 |
+----------------------+
|oculunt |
|predistposed |
|incredulous |
+----------------------+
only showing top 3 rows
代码是
from pyspark.sql.types import ArrayType, FloatType, StringType
my_udf = lambda domain: ['s','n']
label_udf = udf(my_udf, ArrayType(StringType))
df_subsets_concat_with_md = df_subsets_concat.withColumn('subset', label_udf(df_subsets_concat.col1))
结果是
/usr/lib/spark/python/pyspark/sql/types.py in __init__(self, elementType, containsNull)
288 False
289 """
--> 290 assert isinstance(elementType, DataType), "elementType should be DataType"
291 self.elementType = elementType
292 self.containsNull = containsNull
AssertionError: elementType should be DataType
据我所知,这是执行此操作的正确方法.以下是一些资源:pySpark 数据帧 "assert isinstance(dataType, DataType), ";数据类型应该是数据类型"如何返回元组类型"在 PySpark 的 UDF 中?
It is my understanding that this was the correct way to do this. Here are some resources: pySpark Data Frames "assert isinstance(dataType, DataType), "dataType should be DataType" How to return a "Tuple type" in a UDF in PySpark?
但是这些都没有帮助我解决为什么这不起作用.我使用的是 pyspark 1.6.1.
But neither of these have helped me resolve why this is not working. i am using pyspark 1.6.1.
如何在pyspark中创建一个返回字符串数组的udf?
How to create a udf in pyspark which returns an array of strings?
推荐答案
你需要初始化一个 StringType
实例:
You need to initialize a StringType
instance:
label_udf = udf(my_udf, ArrayType(StringType()))
# ^^
df.withColumn('subset', label_udf(df.col1)).show()
+------------+------+
| col1|subset|
+------------+------+
| oculunt|[s, n]|
|predistposed|[s, n]|
| incredulous|[s, n]|
+------------+------+
这篇关于如何在 PySpark 中创建一个返回字符串数组的 udf?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!