如何在Spark中平均分配分区? [英] How to repartition evenly in Spark?

查看:558
本文介绍了如何在Spark中平均分配分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要测试.repartition()的工作方式,我运行了以下代码:

To test how .repartition() works, I ran the following code:

rdd = sc.parallelize(range(100))
rdd.getNumPartitions()

rdd.getNumPartitions()产生4.然后我跑了:

rdd = rdd.repartition(10)
rdd.getNumPartitions()

这次

rdd.getNumPartitions()产生了10,所以现在有10个分区.

rdd.getNumPartitions() this time resulted in 10, so there were now 10 partitions.

但是,我通过以下方式检查分区:

However, I checked the partitions by:

rdd.glom().collect()

结果给出了4个非空列表和6个空列表.为什么没有任何元素分配到其他6个列表中?

The result gave 4 non-empty lists and 6 empty lists. Why haven't any elements been distributed to the other 6 lists?

推荐答案

repartition()背后的算法使用逻辑来优化在分区之间重新分配数据的最有效方法.在这种情况下,您的范围很小,并且无法找到进一步分解数据的最佳选择.如果要使用更大的范围(例如100000),您会发现它实际上确实在重新分配数据.

The algorithm behind repartition() uses logic to optimize the most effective way to redistribute data across partitions. In this case, your range is very small and it doesn't find it optimal to actually break the data down further. If you were to use a much bigger range like 100000, you will find that it does in fact redistribute the data.

如果要强制使用一定数量的分区,则可以根据数据的初始负载来指定分区数.在这一点上,即使它不一定是最佳数据,它也会尝试在各个分区之间平均分配数据.并行化功能对分区使用第二个参数

If you want to force a certain amount of partitions, you could specify the number of partitions upon the intial load of the data. At this point, it will try to evenly distribute the data across partitions even if it's not necessarily optimal. The parallelize function takes a second argument for partitions

    rdd = sc.parallelize(range(100), 10)

如果您要说说从文本文件中读取内容,同样的事情也会起作用

The same thing would work if you were to say read from a text file

    rdd = sc.textFile('path/to/file/, numPartitions)

这篇关于如何在Spark中平均分配分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆