更新RDD元素的内部状态 [英] Update the internal state of RDD elements
问题描述
我是Spark的新手,我想使用 rdd.foreach
方法更新RDD元素的内部状态,但是它不起作用.这是我的代码示例:
I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach
method, but it doesn't work. Here is my code example:
class Test extends Serializable{
var foo = 0.0
var bar = 0.0
def updateFooBar() = {
foo = Math.random()
bar = Math.random()
}
}
var testList = Array.fill(5)(new Test())
var testRDD = sc.parallelize(testList)
testRDD.foreach{ x => x.updateFooBar() }
testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) }
结果是:
0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0
推荐答案
RDD在设计上是不可变的.这种设计选择使它们更加健壮,因为变异是漏洞的常见来源,并且它支持RDD名称的弹性"部分(弹性分布式数据集).如果下游RDD中的分区丢失,Spark可以从其父级重建它.因此,最好将Spark编程视为数据流的构造,即使您不执行流式传输.
RDDs are immutable by design. This design choice makes them more robust, as mutation is a common source of bugs, and it supports the "resilient" part of the RDD name (resilient distributed dataset); if a partition in a downstream RDD is lost, Spark can reconstruct it from its parents. So, it's best to think of Spark programming as construction of dataflows, even when you're not doing streaming.
在 foreach
上,它是为纯粹的副作用"操作而设计的,例如写入磁盘,数据库或控制台.
On foreach
, it's designed for "pure side effect" operations, like writing to disk, a database, or the console.
这篇关于更新RDD元素的内部状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!