具有数百万条记录的2个数据框之间的Pyspark交叉联接 [英] Pyspark crossjoin between 2 dataframes with millions of records
本文介绍了具有数百万条记录的2个数据框之间的Pyspark交叉联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有2个数据框A(3500万条记录)和B(30000条记录)
I have 2 dataframes A(35 Million records) and B(30000 records)
A
|Text |
-------
| pqr |
-------
| xyz |
-------
B
|Title |
-------
| a |
-------
| b |
-------
| c |
-------
下面的数据帧C是在A和B之间的交叉连接之后获得的.
Below dataframe C is obtained after a crossjoin between A and B.
c = A.crossJoin(B, on = [A.text == B.Title)
C
|text | Title |
---------------
| pqr | a |
---------------
| pqr | b |
---------------
| pqr | c |
---------------
| xyz | a |
---------------
| xyz | b |
---------------
| xyz | c |
---------------
以上两列均为String类型.
Both the columns above are of type String.
我正在执行以下操作,这会导致Spark错误(作业失败,因为阶段失败)
I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)
display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())
关于如何进行此连接以避免产生的操作上出现Spark error()的任何建议?
Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?
推荐答案
尝试使用广播
联接
from pyspark.sql.functions import broadcast
c = functions.broadcast(A).crossJoin(B)
如果您不需要并在包含"列中添加额外的列,则可以将其过滤为
If you don't need and extra column "Contains" column thne you can just filter it as
display(c.filter(col("text").contains(col("Title"))).distinct())
这篇关于具有数百万条记录的2个数据框之间的Pyspark交叉联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文