从Spark RDD中删除空分区 [英] Remove Empty Partitions from Spark RDD

查看:1323
本文介绍了从Spark RDD中删除空分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从HDFS获取数据并将其存储在Spark RDD中。 Spark根据HDFS块的数量创建分区数量。这导致大量的空分区在管道中也被处理。为了消除这种开销,我想过滤掉RDD中的所有空分区。我知道聚结和重新分区,但不能保证所有空分区都将被删除。

有没有其他方法可以解决这个问题?

解决方案

没有简单的方法简单地从RDD中删除空分区。



coalesce 不保证将删除空分区。如果您的RDD包含40个空白分区和10个包含数据的分区,则在 rdd.coalesce(45)之后仍会有空分区。



重新分区方法在所有分区上均匀分割数据,所以不会有任何空分区。如果您有一个包含50个空白分区和10个包含数据的分区的RDD,并运行 rdd.repartition(20),则数据将平均分配到20个分区中。


I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.

Is there any other way to go about this?

解决方案

There isn't an easy way to simply delete the empty partitions from a RDD.

coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45).

The repartition method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20), the data will be evenly split across the 20 partitions.

这篇关于从Spark RDD中删除空分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆