具有数百万条记录的2个数据框之间的Pyspark交叉联接 [英] Pyspark crossjoin between 2 dataframes with millions of records

查看：63 发布时间：2021/4/8 20:32:36 python apache-spark pyspark apache-spark-sql pyspark-dataframes

本文介绍了具有数百万条记录的2个数据框之间的Pyspark交叉联接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个数据框A(3500万条记录)和B(30000条记录)

I have 2 dataframes A(35 Million records) and B(30000 records)

|Text |
-------
| pqr  |
-------
| xyz  |
-------

|Title |
-------
| a  |
-------
| b  |
-------
| c  |
-------

下面的数据帧C是在A和B之间的交叉连接之后获得的.

Below dataframe C is obtained after a crossjoin between A and B.

c = A.crossJoin(B, on = [A.text == B.Title)

|text | Title |
---------------
| pqr  | a    |
---------------
| pqr  | b    |
---------------
| pqr  | c    |
---------------
| xyz  | a    |
---------------
| xyz  | b    |
---------------
| xyz  | c    |
---------------

以上两列均为String类型.

Both the columns above are of type String.

我正在执行以下操作，这会导致Spark错误(作业失败，因为阶段失败)

I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)

display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())

关于如何进行此连接以避免产生的操作上出现Spark error()的任何建议?

Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?

火花错误消息

推荐答案

尝试使用广播联接

from pyspark.sql.functions import broadcast
c = functions.broadcast(A).crossJoin(B)

如果您不需要并在包含"列中添加额外的列，则可以将其过滤为

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())

这篇关于具有数百万条记录的2个数据框之间的Pyspark交叉联接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

具有数百万条记录的2个数据框之间的Pyspark交叉联接 [英] Pyspark crossjoin between 2 dataframes with millions of records

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

具有数百万条记录的2个数据框之间的Pyspark交叉联接 [英] Pyspark crossjoin between 2 dataframes with millions of records

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭