Spark:从 RDD 检索大数据到本地机器的最佳实践 [英] Spark: Best practice for retrieving big data from RDD to local machine

查看:27
本文介绍了Spark:从 RDD 检索大数据到本地机器的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在纱线集群中有很大的 RDD(1gb).在使用这个集群的本地机器上,我只有 512 mb.我想在本地机器上迭代 RDD 中的值.我不能使用 collect(),因为它会在本地创建太大的数组,而不是我的堆.我需要一些迭代的方式.有方法 iterator(),但它需要一些额外的信息,我无法提供.

I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.

UDP:提交给LocalIterator方法

推荐答案

更新: RDD.toLocalIterator 方法在写完原答案后出现,是一种更高效的方法完成工作的方式.它使用 runJob 来评估每一步的单个分区.

Update: RDD.toLocalIterator method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob to evaluate only a single partition on each step.

TL;DR 原始答案可能会粗略地说明它是如何工作的:

TL;DR And the original answer might give a rough idea how it works:

首先获取分区索引数组:

First of all, get the array of partition indexes:

val parts = rdd.partitions

然后创建更小的 rdds,过滤掉除单个分区之外的所有内容.从较小的 rdd 收集数据并迭代单个分区的值:

Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:

for (p <- parts) {
    val idx = p.index
    val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
    //The second argument is true to avoid rdd reshuffling
    val data = partRdd.collect //data contains all values from a single partition 
                               //in the form of array
    //Now you can do with the data whatever you want: iterate, save to a file, etc.
}

我没有尝试这个代码,但它应该可以工作.如果它不会编译,请写评论.当然,它只有在分区足够小时才有效.如果不是,您可以随时使用 rdd.coalesce(numParts, true) 增加分区数.

I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with rdd.coalesce(numParts, true).

这篇关于Spark:从 RDD 检索大数据到本地机器的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆