Spark:从 RDD 检索大数据到本地机器的最佳实践 [英] Spark: Best practice for retrieving big data from RDD to local machine
问题描述
我在纱线集群中有很大的 RDD(1gb).在使用这个集群的本地机器上,我只有 512 mb.我想在本地机器上迭代 RDD 中的值.我不能使用 collect(),因为它会在本地创建太大的数组,而不是我的堆.我需要一些迭代的方式.有方法 iterator(),但它需要一些额外的信息,我无法提供.
I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.
UDP:提交给LocalIterator方法
推荐答案
更新: RDD.toLocalIterator
方法在写完原答案后出现,是一种更高效的方法完成工作的方式.它使用 runJob
来评估每一步的单个分区.
Update: RDD.toLocalIterator
method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob
to evaluate only a single partition on each step.
TL;DR 原始答案可能会粗略地说明它是如何工作的:
TL;DR And the original answer might give a rough idea how it works:
首先获取分区索引数组:
First of all, get the array of partition indexes:
val parts = rdd.partitions
然后创建更小的 rdds,过滤掉除单个分区之外的所有内容.从较小的 rdd 收集数据并迭代单个分区的值:
Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:
for (p <- parts) {
val idx = p.index
val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
//The second argument is true to avoid rdd reshuffling
val data = partRdd.collect //data contains all values from a single partition
//in the form of array
//Now you can do with the data whatever you want: iterate, save to a file, etc.
}
我没有尝试这个代码,但它应该可以工作.如果它不会编译,请写评论.当然,它只有在分区足够小时才有效.如果不是,您可以随时使用 rdd.coalesce(numParts, true)
增加分区数.
I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with rdd.coalesce(numParts, true)
.
这篇关于Spark:从 RDD 检索大数据到本地机器的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!