解释 Spark 中的聚合功能(使用 Python 和 Scala) [英] Explain the aggregate functionality in Spark (with Python and Scala)

查看:29
本文介绍了解释 Spark 中的聚合功能(使用 Python 和 Scala)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找对 Python 中通过 spark 可用的聚合功能的更好解释.

I am looking for some better explanation of the aggregate functionality that is available via spark in python.

我的例子如下(使用Spark 1.2.0版本的pyspark)

The example I have is as follows (using pyspark from Spark 1.2.0 version)

sc.parallelize([1,2,3,4]).aggregate(
  (0, 0),
  (lambda acc, value: (acc[0] + value, acc[1] + 1)),
  (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))

输出:

(10, 4)

我得到了预期的结果 (10,4) 这是 1+2+3+4 和 4 个元素的总和.如果我将传递给聚合函数的初始值从 (0,0) 更改为 (1,0) 我得到以下结果

I get the expected result (10,4) which is sum of 1+2+3+4 and 4 elements. If I change the initial value passed to the aggregate function to (1,0) from (0,0) I get the following result

sc.parallelize([1,2,3,4]).aggregate(
    (1, 0),
    (lambda acc, value: (acc[0] + value, acc[1] + 1)),
    (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])))

输出:

(19, 4)

值增加 9.如果我将其更改为 (2,0),则该值会变为 (28,4),依此类推.

The value increases by 9. If I change it to (2,0), the value goes to (28,4) and so on.

有人可以向我解释这个值是如何计算的吗?我期望值增加 1 而不是 9,期望看到 (11,4) 而我看到的是 (19,4).

Can someone explain to me how this value is calculated? I expected the value to go up by 1 not by 9, expected to see (11,4) instead I am seeing (19,4).

推荐答案

我没有足够的声望点数来评论 Maasg 之前的回答.实际上,零值对于 seqop 应该是中性的",这意味着它不会干扰 seqop 结果,例如 0 表示添加,或 1 表示 *;

I don't have enough reputation points to comment on the previous answer by Maasg. Actually the zero value should be 'neutral' towards the seqop, meaning it wouldn't interfere with the seqop result, like 0 towards add, or 1 towards *;

您永远不要尝试使用非中性值,因为它可能会被任意应用.这种行为不仅与分区数量有关.

You should NEVER try with non-neutral values as it might be applied arbitrary times. This behavior is not only tied to num of partitions.

我尝试了与问题中所述相同的实验.对于 1 个分区,零值应用了 3 次.带2个分区,6次.有 3 个分区,9 次,这将继续.

I tried the same experiment as stated in the question. with 1 partition, the zero value was applied 3 times. with 2 partitions, 6 times. with 3 partitions, 9 times and this will go on.

这篇关于解释 Spark 中的聚合功能(使用 Python 和 Scala)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆