阿帕奇引发聚合函数使用最小值 [英] apache spark aggregate function using min value

查看:139
本文介绍了阿帕奇引发聚合函数使用最小值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试过 http://homepage.cs.latrobe找到了一个例子。 edu.au/zhe/ZhenHeSparkRDDAPIExamples.html

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11

为什么最小长度为1?所述第一分区包含[12,23],第二个[345,4567]。最小距离与初始值,最小值应为0而在我理解预期的结果的任何分区进行比较。将00

Why the min length is 1? The first partition contains ["12", "23"] and the second one ["345","4567"]. Comparing the min from any partition with the initial value "", the min value should be 0. And the expected result in my understanding would be 00

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10

这一个我所理解的一样,最后的结果应该是00

for this one I understand the same, the final result should be 00

先谢谢了。

推荐答案

首先让我们看看如何并行分区之间将您的数据:

First lets see how parallelize splits your data between partitions:

val x = sc.parallelize(List("12","23","345","4567"), 2)
x.glom.collect
// Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))

val y = sc.parallelize(List("12","23","345",""), 2)
y.glom.collect
// Array[Array[String]] = Array(Array(12, 23), Array(345, ""))

和定义了两个帮手:

def seqOp(x: String, y: String) =  math.min(x.length, y.length).toString
def combOp(x: String, y: String) = x + y

现在可以为 X 跟踪执行。忽略并行它可以重新presented如下:

Now lets trace execution for x. Ignoring parallelism it can be represented as follows:

(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "4567"))
(combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "4567"))
(combOp "1" (seqOp (seqOp "" "345") "4567"))
(combOp "1" (seqOp "0" "4567"))
(combOp "1" "1")
"11"

Y中的同样的事情

(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") ""))
(combOp (seqOp "0" "23") (seqOp (seqOp "" "345") ""))
(combOp "1" (seqOp (seqOp "" "345") ""))
(combOp "1" (seqOp "0" ""))
(combOp "1" "0")
"10"

话虽这么说,你不应该使用在这里摆在首位。由于业务应用不相关联的整体想法是完全错误的。

That being said you shouldn't use aggregate here in the first place. Since operations you apply are not associative a whole idea is simply wrong.

这篇关于阿帕奇引发聚合函数使用最小值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆