使用UDF进行Pyspark数据框联接 [英] Pyspark Dataframe Join using UDF
问题描述
我正在尝试为PySpark中的两个数据帧(df1和df2)创建自定义联接(类似于
I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:
my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))
我收到的错误消息是:
java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child
是否可以编写一种PySpark UDF来处理来自两个单独数据帧的列?
Is there a way to write a PySpark UDF that can process columns from two separate dataframes?
推荐答案
Spark 2.2 +
df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))
Spark 2.0、2.1
下面显示的方法在Spark 2.x中不再起作用.参见 SPARK-19728 .
Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.
火花1.x
理论上,您可以加入和过滤:
Theoretically you can join and filter:
df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))
但是一般来说,您不应该全部都这么做.任何不基于相等性的join
类型都需要完整的笛卡尔积(与答案相同),这几乎是不可接受的(另请参见为什么在SQL查询中使用UDF会产生笛卡尔积?).
but in general you shouldn't to it all. Any type of join
which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).
这篇关于使用UDF进行Pyspark数据框联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!