如何在循环中覆盖RDD [英] How to overwrite a RDD in a loop

查看:78
本文介绍了如何在循环中覆盖RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Spark和Scala还是陌生的,并且正在实现一种处理大图的迭代算法.假设在for循环中,我们有两个RDD(rdd1和rdd2),并且它们的值会更新.例如:

I am very new to Spark and Scala and I am implementing an iterative algorithm that manipulates a big graph. Assume that inside a for loop, we have two RDDs (rdd1 and rdd2) and their value get updated. for example something like:

for (i <- 0 to 5){
   val rdd1 = rdd2.some Transformations
   rdd2 = rdd1
}

因此,基本上,在迭代i + 1期间,将基于rdi1在迭代i时的值来计算rdd1的值.我知道RDD是不可变的,因此我无法真正将任何东西重新分配给他们,但是我只是想知道,我的想法有可能实现或不实现.如果是这样,怎么办?任何帮助,我们将不胜感激.

so basically, during iteration i+1 the value of rdd1 is computed based on its value at iteration i. I know that RDDs are immutable so I can not really reassign anything to them, but I just wanted to know, what I have in mind is possible to implement or not. If so, how? Any help is greatly appreciated.

谢谢

已更新:当我尝试此代码时:

updated: when I try this code:

var size2 = freqSubGraphs.join(groupedNeighbours).map(y => extendFunc(y))

for(i <- 0 to 5){
    var size2 = size2.map(y=> readyForExpandFunc(y))
}
size2.collect()

它给了我这个错误:递归变量size2需要类型"我不确定这是什么意思

it is giving me this error: "recursive variable size2 needs type" I am not sure what it means

推荐答案

只需打开一个火花壳并尝试一下:

Just open a spark-shell and try it:

scala> var rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> for( i <- 0 to 5 ) { rdd1 = rdd1.map( _ + 1 ) }

scala> rdd1.collect()
res1: Array[Int] = Array(7, 8, 9, 10, 11)                                       

如您所见,它可以正常工作.

as you can see, it works.

这篇关于如何在循环中覆盖RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆