具有数百万条记录的2个数据框之间的Pyspark交叉联接 [英] Pyspark crossjoin between 2 dataframes with millions of records

查看:63
本文介绍了具有数百万条记录的2个数据框之间的Pyspark交叉联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据框A(3500万条记录)和B(30000条记录)

I have 2 dataframes A(35 Million records) and B(30000 records)

A

|Text |
-------
| pqr  |
-------
| xyz  |
------- 

B

|Title |
-------
| a  |
-------
| b  |
-------
| c  |
------- 

下面的数据帧C是在A和B之间的交叉连接之后获得的.

Below dataframe C is obtained after a crossjoin between A and B.

c = A.crossJoin(B, on = [A.text == B.Title)

C

|text | Title |
---------------
| pqr  | a    |
---------------
| pqr  | b    |
---------------
| pqr  | c    |
---------------
| xyz  | a    |
---------------
| xyz  | b    |
---------------
| xyz  | c    |
---------------

以上两列均为String类型.

Both the columns above are of type String.

我正在执行以下操作,这会导致Spark错误(作业失败,因为阶段失败)

I am performing the below operation and it results in an Spark error(Job aborted due to stage failure)

display(c.withColumn("Contains", when(col('text').contains(col('Title')), 1).otherwise(0)).filter(col('Contains') == 0).distinct())

关于如何进行此连接以避免产生的操作上出现Spark error()的任何建议?

Any suggestions on how this join needs to be done to avoid the Spark error() on the resulting operations?

火花错误消息

推荐答案

尝试使用广播联接

from pyspark.sql.functions import broadcast
c = functions.broadcast(A).crossJoin(B)

如果您不需要并在包含"列中添加额外的列,则可以将其过滤为

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())

这篇关于具有数百万条记录的2个数据框之间的Pyspark交叉联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆