repartition()不影响RDD分区大小 [英] repartition() is not affecting RDD partition size

查看:122
本文介绍了repartition()不影响RDD分区大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用repartition()方法更改RDD的分区大小.RDD上的方法调用成功,但是当我使用RDD的partition.size属性显式检查分区大小时,我得到的分区数量与它最初拥有的分区数量相同:-

I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-

scala> rdd.partitions.size
res56: Int = 50

scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27

在这个阶段,我执行诸如rdd.take(1)之类的操作只是为了强制评估,以防万一.然后我再次检查分区大小:-

At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-

scala> rdd.partitions.size
res58: Int = 50

正如人们所看到的,它没有改变.有人可以回答为什么吗?

As one can see, it's not changing. Can someone answer why?

推荐答案

首先,因为 repartition 确实很懒,所以运行一个操作确实很重要.其次, repartition 返回更改了分区的新 RDD ,因此您必须使用返回的 RDD ,否则您仍在使用旧的分区.最后,在缩小分区时,应使用 coalesce ,因为那样不会重新洗改数据.取而代之的是,它将数据保留在节点数上,并引入其余的孤儿.

First, it does matter that you run an action as repartition is indeed lazy. Second, repartition returns a new RDD with the partitioning changed, so you must use the returned RDD or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.

这篇关于repartition()不影响RDD分区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆