使用 UDF 连接 Pyspark 数据框 [英] Pyspark Dataframe Join using UDF

查看：30 发布时间：2021/11/14 21:45:36 python apache-spark pyspark apache-spark-sql user-defined-functions

本文介绍了使用 UDF 连接 Pyspark 数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为 PySpark 中的两个数据帧(df1 和 df2)创建自定义连接(类似于 this)，代码如下所示:

I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:

my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))

我收到的错误信息是:

java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child

有没有办法编写一个 PySpark UDF 来处理来自两个独立数据帧的列?

Is there a way to write a PySpark UDF that can process columns from two separate dataframes?

推荐答案

Spark 2.2+

你必须使用 crossJoin 或在配置中启用交叉连接:>

You have to use crossJoin or enable cross joins in the configuration:

df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))

Spark 2.0、2.1

下面显示的方法在 Spark 2.x 中不再有效.请参阅 SPARK-19728.

Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.

Spark 1.x

理论上你可以加入和过滤:

Theoretically you can join and filter:

df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))

但总的来说，您不应该全部完成.任何不基于等式的 join 都需要一个完整的笛卡尔积(与答案相同)，这很少被接受(另见为什么在 SQL 查询中使用 UDF 会导致笛卡尔积?).

but in general you shouldn't to it all. Any type of join which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).

这篇关于使用 UDF 连接 Pyspark 数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 UDF 连接 Pyspark 数据框 [英] Pyspark Dataframe Join using UDF

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 UDF 连接 Pyspark 数据框 [英] Pyspark Dataframe Join using UDF

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭