为什么Spark crossJoin对于一个很小的数据帧要花这么长时间? [英] Why does Spark crossJoin take so long for a tiny dataframe?
问题描述
我正在尝试在每个都有5行的两个数据帧上执行以下crossJoin,但是Spark在我的机器上产生了40000个任务,完成任务花了30秒.知道为什么会这样吗?
I'm trying to do the following crossJoin on two dataframes with 5 rows each, but Spark spawns 40000 tasks on my machine and it took 30 seconds to achieve the task. Any idea why that is happening?
df = spark.createDataFrame([['1','1'],['2','2'],['3','3'],['4','4'],['5','5']]).toDF('a','b')
df = df.repartition(1)
df.select('a').distinct().crossJoin(df.select('b').distinct()).count()
推荐答案
在加入之前调用 .distinct
,它需要重新排序,因此它将基于 spark.sql重新分区数据.shuffle.partitions 属性值.因此, df.select('a').distinct()
和 df.select('b').distinct()
会产生新的DataFrame,每个DataFrame都有200个分区,200 x 200 = 40000
You call a .distinct
before join, it requires a shuffle, so it repartitions data based on spark.sql.shuffle.partitions property value. Thus, df.select('a').distinct()
and df.select('b').distinct()
result in new DataFrames each with 200 partitions, 200 x 200 = 40000
这篇关于为什么Spark crossJoin对于一个很小的数据帧要花这么长时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!