为什么是必要的,星火折行动? [英] Why is the fold action necessary in Spark?

查看:189
本文介绍了为什么是必要的,星火折行动?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个愚蠢的问题涉及褶皱和减少Pyspark。我理解这两种方法之间的差异,但如果双方都需要的应用功能是可交换幺半,我想不通其中折不能被降低被取代的例子。

I've a silly question involving fold and reduce in Pyspark. I understand the difference between these two methods, but, if both need that the applied function is a commutative monoid, I cannot figure out an example in which fold cannot be substituted by reduce.

此外,在Pyspark实现折叠则使用 ACC = OP(OBJ,ACC),为什么这个操作顺序用于代替ACC = OP(ACC,OBJ)? (这个二阶听起来更接近于一个leftFold我)

Besides, in the Pyspark implementation of fold it is used acc = op(obj, acc), why this operation order is used instead of acc = op(acc, obj)? (this second order sounds more closed to a leftFold to me)

干杯

托马斯

推荐答案

空RDD

RDD 是空的,不能被取代:

It cannot be substituted when RDD is empty:

val rdd = sc.emptyRDD[Int]
rdd.reduce(_ + _)
// java.lang.UnsupportedOperationException: empty collection at   
// org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$ ...

rdd.fold(0)(_ + _)
// Int = 0

您当然可以结合减少与条件对的isEmpty ,但它是相当难看。

You can of course combine reduce with condition on isEmpty but it is rather ugly.

可变缓冲

另一个用例折是可变的缓冲聚集。考虑以下RDD:

Another use case for fold is aggregation with mutable buffer. Consider following RDD:

import breeze.linalg.DenseVector

val rdd = sc.parallelize(Array.fill(100)(DenseVector(1)), 8)

比方说,我们希望所有元素的总和。一个天真的解决方案是简单地减少 +

rdd.reduce(_ + _)

不幸的是,为每个元素的新载体。由于对象的创建和随后的垃圾收集是昂贵的它可能是最好使用可变对象。这是不可能的减少(RDD的永恒性并不意味着元素的不变性),但是可以用可实现折叠如下:

Unfortunately it creates a new vector for each element. Since object creation and subsequent garbage collection is expensive it could be better to use a mutable object. It is not possible with reduce (immutability of RDD doesn't imply immutability of the elements), but can be achieved with fold as follows:

rdd.fold(DenseVector(0))((acc, x) => acc += x)

零元素在这里被用作每个分区离去实际数据原封不动初始化一次可变缓冲

Zero element is used here as mutable buffer initialized once per partition leaving actual data untouched.

ACC =为什么这个经营秩序是用来代替ACC运(OBJ,ACC),OP =(ACC,OBJ)

acc = op(obj, acc), why this operation order is used instead of acc = op(acc, obj)

请参阅 SPARK-6416 和的 SPARK-7683

这篇关于为什么是必要的,星火折行动?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆