Spark重新分区速度很慢,并且会拖尾太多数据 [英] Spark repartition is slow and shuffles too much data

查看:80
本文介绍了Spark重新分区速度很慢,并且会拖尾太多数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的集群:

  • 5个数据节点
  • 每个数据节点具有:8个CPU,45GB内存

由于其他一些配置限制,我只能在每个数据节点上启动5个执行程序.所以我做到了

Due to some other configuration limit, I can only start 5 executors on each data node. So I did

spark-submit --num-executors 30 --executor-memory 2G ...

所以每个执行者使用1个核心.

So each executor uses 1 core.

我有两个数据集,每个数据集大约20 GB.在我的代码中,我做到了:

I have two data set, each is about 20 GB. In my code, I did:

val rdd1 = sc...cache()
val rdd2 = sc...cache()

val x = rdd1.cartesian(rdd2).repartition(30) map ...

在Spark UI中,我看到 repartition 步骤花费了超过30分钟的时间,它导致超过150GB的数据混洗.

In the Spark UI, I saw the repartition step took more than 30 mins, and it cause data shuffle of more than 150GB.

我认为这是不对的.但是我不知道出了什么问题...

I don't think it is right. But I could not figure out what goes wrong...

推荐答案

您是真的意思是笛卡尔"吗?

Did you really mean "cartesian"?

您正在将RDD1中的每一行乘以RDD2中的每一行.因此,如果每行1k,则每个RDD大约有20,000行.笛卡尔积将返回具有20,000 x 20,000或4亿条记录的集合.并请注意,现在每行的宽度将增加一倍,即2k,因此RDD3中将有800 GB,而RDD1和RDD2中只有20 GB.

You are multiplying every row in RDD1 by every row in RDD2. So if your rows were 1k each, you had about 20,000 rows per RDD. The cartesian product will return a set with 20,000 x 20,000 or 400 million records. And note that each row would now be double in width -- 2k -- so you'd have 800 GB in RDD3 whereas you only had 20 GB each in RDD1 and RDD2.

也许尝试:

val x = rdd1.union(rdd2).repartition(30) map ...

甚至:

val x = rdd1.zip(rdd2).repartition(30) map ...

?

这篇关于Spark重新分区速度很慢,并且会拖尾太多数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆