解释Spark中的汇总功能 [英] Explain the aggregate functionality in Spark

查看:95
本文介绍了解释Spark中的汇总功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关通过python中的spark可用的聚合功能的更好解释。

I am looking for some better explanation of the aggregate functionality that is available via spark in python.

我的示例如下(使用Spark 1.2.0版本的pyspark)

The example I have is as follows (using pyspark from Spark 1.2.0 version)

sc.parallelize([1,2,3,4]).aggregate(
  (0, 0),
  (lambda acc, value: (acc[0] + value, acc[1] + 1)),
  (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))

输出:

(10, 4)

我得到了预期的结果(10,4) 1 + 2 + 3 + 4 和4个元素的总和。如果我将传递给聚合函数的初始值从(0,0)更改为(1,0)得到以下结果

I get the expected result (10,4) which is sum of 1+2+3+4 and 4 elements. If I change the initial value passed to the aggregate function to (1,0) from (0,0) I get the following result

sc.parallelize([1,2,3,4]).aggregate(
    (1, 0),
    (lambda acc, value: (acc[0] + value, acc[1] + 1)),
    (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))

输出:

(19, 4)

该值增加9。如果将其更改为(2,0),该值将变为(28,4),依此类推。

The value increases by 9. If I change it to (2,0), the value goes to (28,4) and so on.

有人可以向我解释该值的计算方式吗?我预计值将增加1,而不是9,预期将看到(11,4),而我看到的是(19,4)

Can someone explain to me how this value is calculated? I expected the value to go up by 1 not by 9, expected to see (11,4) instead I am seeing (19,4).

推荐答案

我没有足够的声誉来评论Maasg的先前答案。
实际上,零值应该相对于seqop,是中性的,这意味着它不会干扰seqop的结果,例如0代表加,或1代表*;

I don't have enough reputation points to comment on the previous answer by Maasg. Actually the zero value should be 'neutral' towards the seqop, meaning it wouldn't interfere with the seqop result, like 0 towards add, or 1 towards *;

您永远不要尝试使用非中性值,因为它可能会被任意应用。
这种行为不仅与分区数有关。

You should NEVER try with non-neutral values as it might be applied arbitrary times. This behavior is not only tied to num of partitions.

我尝试了与问题所述相同的实验。
具有1个分区,将零值应用了3次。
有2个分区,共6次。
具有3个分区,共9次,此操作将继续。

I tried the same experiment as stated in the question. with 1 partition, the zero value was applied 3 times. with 2 partitions, 6 times. with 3 partitions, 9 times and this will go on.

这篇关于解释Spark中的汇总功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆