更新RDD元素的内部状态 [英] Update the internal state of RDD elements

查看:65
本文介绍了更新RDD元素的内部状态的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的新手,我想使用 rdd.foreach 方法更新RDD元素的内部状态,但是它不起作用.这是我的代码示例:

I'm newbie in Spark and I want to update the internal state of my RDD's elements with rdd.foreach method, but it doesn't work. Here is my code example:

class Test extends Serializable{
  var foo = 0.0
  var bar = 0.0

  def updateFooBar() = {
    foo = Math.random()
    bar = Math.random()
  }
}

var testList = Array.fill(5)(new Test())
var testRDD = sc.parallelize(testList)
testRDD.foreach{ x => x.updateFooBar() }
testRDD.collect().foreach { x=> println(x.foo+"~"+x.bar) }

结果是:

0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0
0.0~0.0

推荐答案

RDD在设计上是不可变的.这种设计选择使它们更加健壮,因为变异是漏洞的常见来源,并且它支持RDD名称的弹性"部分(弹性分布式数据集).如果下游RDD中的分区丢失,Spark可以从其父级重建它.因此,最好将Spark编程视为数据流的构造,即使您不执行流式传输.

RDDs are immutable by design. This design choice makes them more robust, as mutation is a common source of bugs, and it supports the "resilient" part of the RDD name (resilient distributed dataset); if a partition in a downstream RDD is lost, Spark can reconstruct it from its parents. So, it's best to think of Spark programming as construction of dataflows, even when you're not doing streaming.

foreach 上,它是为纯粹的副作用"操作而设计的,例如写入磁盘,数据库或控制台.

On foreach, it's designed for "pure side effect" operations, like writing to disk, a database, or the console.

这篇关于更新RDD元素的内部状态的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆