Spark 聚合函数-aggregateByKey 是如何工作的? [英] How does Spark aggregate function - aggregateByKey work?

查看：30 发布时间：2021/11/12 5:39:18 apache-spark distributed-computing

本文介绍了Spark 聚合函数-aggregateByKey 是如何工作的?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我在 3 个节点上有一个分布式系统，我的数据分布在这些节点之间.例如，我有一个 test.csv 文件，它存在于所有 3 个节点上，它包含 2 列:

Say I have a distribute system on 3 nodes and my data is distributed among those nodes. for example, I have a test.csv file which exists on all 3 nodes and it contains 2 columns of:

**row   | id,  c.**
---------------
row1  | k1 , c1  
row2  | k1 , c2  
row3  | k1 , c3  
row4  | k2 , c4  
row5  | k2 , c5  
row6  | k2 , c6  
row7  | k3 , c7  
row8  | k3 , c8  
row9  | k3 , c9  
row10 | k4 , c10   
row11 | k4 , c11  
row12 | k4 , c12

然后我使用 SparkContext.textFile 将文件读取为 rdd 等.据我了解，每个 spark 工作节点都会从文件中读取一部分.所以现在假设每个节点将存储:

Then I use SparkContext.textFile to read the file out as rdd and so. So far as I understand, each spark worker node will read the a portion out from the file. So right now let's say each node will store:

节点 1:第 1~4 行
节点 2:第 5~8 行
节点 3:第 9~12 行

我的问题是，假设我想对这些数据进行计算，并且有一个步骤需要将键组合在一起，因此键值对将是 [k1 [{k1 c1} {k1 c2} {k1 c3}]].. 等等.

My question is that let's say I want to do computation on those data, and there is one step that I need to group the key together, so the key value pair would be [k1 [{k1 c1} {k1 c2} {k1 c3}]].. and so on.

有一个函数叫groupByKey()，使用起来非常昂贵，推荐使用aggregateByKey().所以我想知道 groupByKey() 和 aggregateByKey() 是如何工作的?有人可以使用我上面提供的例子来解释吗?shuffle 后，行在每个节点上的位置是什么?

There is a function called groupByKey() which is very expensive to use, and aggregateByKey() is recommended to use. So I'm wondering how does groupByKey() and aggregateByKey() works under the hood? Can someone using the example I provided above to explain please? After shuffling where does the rows reside on each node?

Spark 聚合函数-aggregateByKey 是如何工作的? [英] How does Spark aggregate function - aggregateByKey work?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 聚合函数-aggregateByKey 是如何工作的? [英] How does Spark aggregate function - aggregateByKey work?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭