将自定义函数应用于PySpark中数据框的选定列的单元格 [英] Apply custom function to cells of selected columns of a data frame in PySpark

查看：63 发布时间：2021/4/8 19:34:11 python apache-spark pyspark spark-dataframe

本文介绍了将自定义函数应用于PySpark中数据框的选定列的单元格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个看起来像这样的数据框:

Let's say I have a data frame which looks like this:

+---+-----------+-----------+
| id|   address1|   address2|
+---+-----------+-----------+
|  1|address 1.1|address 1.2|
|  2|address 2.1|address 2.2|
+---+-----------+-----------+

我想将自定义函数直接应用于 address1 和 address2 列中的字符串，例如:

I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example:

def example(string1, string2):
    name_1 = string1.lower().split(' ')
    name_2 = string2.lower().split(' ')
    intersection_count = len(set(name_1) & set(name_2))

    return intersection_count

我想将结果存储在新列中，这样我的最终数据帧将如下所示:

I want to store the result in a new column, so that my final data frame would look like:

+---+-----------+-----------+------+
| id|   address1|   address2|result|
+---+-----------+-----------+------+
|  1|address 1.1|address 1.2|     2|
|  2|address 2.1|address 2.2|     7|
+---+-----------+-----------+------+

我试图以一种曾经将内置函数应用于整个列的方式来执行它，但出现错误:

I've tried to execute it in a way I once applied a built-in function to the whole column, but I got an error:

>>> df.withColumn('result', example(df.address1, df.address2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in example
TypeError: 'Column' object is not callable

我在做什么错了，如何将自定义函数应用于选定列中的字符串?

What am I doing wrong and how I can apply a custom function to strings in selected columns?

将自定义函数应用于PySpark中数据框的选定列的单元格 [英] Apply custom function to cells of selected columns of a data frame in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将自定义函数应用于PySpark中数据框的选定列的单元格 [英] Apply custom function to cells of selected columns of a data frame in PySpark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭