为什么总在星火折叠两种不同的API? [英] Why are aggregate and fold two different APIs in Spark?

查看:233
本文介绍了为什么总在星火折叠两种不同的API?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用Scala的标准库,我可以做财产以后这样的:

When using the Scala standard lib, I can do somthing like this:

scala> val scalaList = List(1,2,3)
scalaList: List[Int] = List(1, 2, 3)

scala> scalaList.foldLeft(0)((acc,n)=>acc+n)
res0: Int = 6

使一个诠释了许多整型的。

Making one Int out of many Ints.

我可以做这样的事情:

scala> scalaList.foldLeft("")((acc,n)=>acc+n.toString)
res1: String = 123

使一个字符串了许多整型的。

Making one String out of many Ints.

所以,foldLeft可以是同质或异质,取其我们想要的,这是在一个API。

So, foldLeft could be either homogeneous or heterogeneous, whichever we want, it's in one API.

而在星火,如果我想要一个诠释了许多整型,我可以做到这一点:

While in Spark, if I want one Int out of many Ints, I can do this:

scala> val rdd = sc.parallelize(List(1,2,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> rdd.fold(0)((acc,n)=>acc+n)
res1: Int = 6

折叠API是类似于foldLeft,但它仅仅是均相的,一个RDD [INT]只能与折产生诠释

The fold API is similar to foldLeft, but it is only homogeneous, a RDD[Int] can only produce Int with fold.

有在火花合计API太:

There is a aggregate API in spark too:

scala> rdd.aggregate("")((acc,n)=>acc+n.toString, (s1,s2)=>s1+s2)
res11: String = 132

这是异类,一个RDD [INT]现在可以生成一个字符串。

It is heterogeneous, a RDD[Int] can produce a String now.

那么,为什么倍,合计实现为星火两种不同的API?

So, why are fold and aggregate implemented as two different APIs in Spark?

他们为什么不这样设计foldLeft这可能是既均相和多相?

Why are they not designed like foldLeft that could be both homogeneous and heterogeneous?

(我是很新的火花,请原谅我,如果这是一个愚蠢的问题。)

(I am very new to Spark, please excuse me if this is a silly question.)

推荐答案

折叠可以更有效地实现,因为它不依赖评估的固定顺序。因此,每个集群节点可以倍并行自己的块,然后一个小小的整体折叠结尾。而用 foldLeft 每个元件具有以被折叠在并没有什么可以并行进行。

fold can be implemented more efficiently because it doesn't depend on a fixed order of evaluation. So each cluster node can fold its own chunk in parallel, and then one small overall fold at the end. Whereas with foldLeft each element has to be folded in in order and nothing can be done in parallel.

(也真的很高兴有为了方便起见,通常情况下一个简单的API。该标准库有减少以及 foldLeft 这个原因)

(Also it's nice to have a simpler API for the common case for convenience. The standard lib has reduce as well as foldLeft for this reason)

这篇关于为什么总在星火折叠两种不同的API?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆