将自定义函数应用于PySpark中数据框的选定列的单元格 [英] Apply custom function to cells of selected columns of a data frame in PySpark
问题描述
假设我有一个看起来像这样的数据框:
Let's say I have a data frame which looks like this:
+---+-----------+-----------+
| id| address1| address2|
+---+-----------+-----------+
| 1|address 1.1|address 1.2|
| 2|address 2.1|address 2.2|
+---+-----------+-----------+
我想将自定义函数直接应用于 address1 和 address2 列中的字符串,例如:
I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example:
def example(string1, string2):
name_1 = string1.lower().split(' ')
name_2 = string2.lower().split(' ')
intersection_count = len(set(name_1) & set(name_2))
return intersection_count
我想将结果存储在新列中,这样我的最终数据帧将如下所示:
I want to store the result in a new column, so that my final data frame would look like:
+---+-----------+-----------+------+
| id| address1| address2|result|
+---+-----------+-----------+------+
| 1|address 1.1|address 1.2| 2|
| 2|address 2.1|address 2.2| 7|
+---+-----------+-----------+------+
我试图以一种曾经将内置函数应用于整个列的方式来执行它,但出现错误:
I've tried to execute it in a way I once applied a built-in function to the whole column, but I got an error:
>>> df.withColumn('result', example(df.address1, df.address2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in example
TypeError: 'Column' object is not callable
我在做什么错了,如何将自定义函数应用于选定列中的字符串?
What am I doing wrong and how I can apply a custom function to strings in selected columns?
推荐答案
您必须在spark中使用udf(用户定义函数)
You have to use udf (user defined function) in spark
from pyspark.sql.functions import udf
example_udf = udf(example, LongType())
df.withColumn('result', example_udf(df.address1, df.address2))
这篇关于将自定义函数应用于PySpark中数据框的选定列的单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!