会不会有什么场景,其中星火RDD的无法满足不变性。? [英] Will there be any scenario, where Spark RDD's fail to satisfy immutability.?

查看:181
本文介绍了会不会有什么场景,其中星火RDD的无法满足不变性。?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

星火RDD的在一成不变的,容错和灵活的方式构成。

Spark RDD's are constructed in Immutable, Fault Tolerant and resilient manner.

RDDS是否满足所有场景永恒?或者是有任何情况下,无论是在流或Core,其中RDD可能无法满足不变性。

Does RDDs satisfy immutability in all scenarios.? or is there any case, be it in Streaming or Core, where RDD might fail to satisfy immutability.?

推荐答案

这要看你是什么意思时,你说说 RDD 。严格地说 RDD 仅有谱系的描述只存在在驾驶员和它不提供可用于改动其谱系的任何方法

It depends on what you mean when you talk about RDD. Strictly speaking RDD is just a description of lineage which exists only on the driver and it doesn't provide any methods which can be used to mutate its lineage.

当数据被处理,我们不能再谈论有关RDDS但尽管如此任务数据是使用不可变的数据结构(暴露的 scala.collection.Iterator Scala中,的 itertools.chain 在Python)。

When data is processed we can no longer talk about about RDDs but tasks nevertheless data is exposed using immutable data structures (scala.collection.Iterator in Scala, itertools.chain in Python).

到目前为止好。不幸的数据结构的不变性并不意味着所存储的数据的不变性。让我们创建一个小例子来说明:

So far so good. Unfortunately immutability of a data structure doesn't imply immutability of the stored data. Lets create a small example to illustrate that:

val rdd = sc.parallelize(Array(0) :: Array(0) :: Array(0) :: Nil)
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0

只要你想,您可以执行此多次,并得到同样的结果。现在让缓存 RDD 和重复整个过程:

rdd.cache
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 3.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 6.0
rdd.map(a => { a(0) +=1; a.head }).sum
// Double = 9.0

由于功能我们在第一地图中使用不纯正,并修改它的可变参数的地方,这些变化都积累了每个执行并导致联合国predictable输出。例如,如果 RDD 从缓存中,我们可以再次获得3.0驱逐。如果某些分区没有缓存可以混合的结果。

Since function we use in the first map is not pure and modifies its mutable argument in place these changes are accumulated with each execution and result in unpredictable output. For example if rdd is evicted from cache we can once again get 3.0. If some partitions are not cached you can mixed results.

PySpark提供了更强的隔离,这样获得的结果是不可能的,但它是架构的问题不是永恒性。

PySpark provides stronger isolation and obtaining result like this is not possible but it is a matter of architecture not a immutability.

带走的消息在这里,你应该与可变数据时非常小心,避免除非明确允许在发生的任何修改(折叠)。

Take away message here is that you should be extremely careful when working with mutable data and avoid any modifications in place unless it is explicitly allowed (fold, aggregate).

这篇关于会不会有什么场景,其中星火RDD的无法满足不变性。?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆