从Spark RDD中删除空分区 [英] Remove Empty Partitions from Spark RDD
问题描述
我从HDFS获取数据并将其存储在Spark RDD中。 Spark根据HDFS块的数量创建分区数量。这导致大量的空分区在管道中也被处理。为了消除这种开销,我想过滤掉RDD中的所有空分区。我知道聚结和重新分区,但不能保证所有空分区都将被删除。
有没有其他方法可以解决这个问题?
没有简单的方法简单地从RDD中删除空分区。
coalesce
不保证将删除空分区。如果您的RDD包含40个空白分区和10个包含数据的分区,则在 rdd.coalesce(45)
之后仍会有空分区。
重新分区
方法在所有分区上均匀分割数据,所以不会有任何空分区。如果您有一个包含50个空白分区和10个包含数据的分区的RDD,并运行 rdd.repartition(20)
,则数据将平均分配到20个分区中。
I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.
Is there any other way to go about this?
There isn't an easy way to simply delete the empty partitions from a RDD.
coalesce
doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45)
.
The repartition
method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20)
, the data will be evenly split across the 20 partitions.
这篇关于从Spark RDD中删除空分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!