AggregateByKey分区? [英] AggregateByKey Partitioning?

查看:111
本文介绍了AggregateByKey分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有:

A_RDD = anRDD.map()

B_RDD = A_RDD.aggregateByKey()

好的,我的问题是:

如果我在A_RDD之后放置partitionBy(new HashPartitioner),如下所示:

If i put partitionBy(new HashPartitioner) after A_RDD like :

A_RDD = anRDD.map().partitionBy(new HashPartitioner(2))

B_RDD = A_RDD.aggregateByKey()

1)首先,这会和我将其保持原样一样有效率吗? AggregateByKey()将使用A_RDD中的hashPartitioner,对吗?

1)Will this be the same efficient as if i leave it as it is, in the first place? aggregateByKey() will use that hashPartitioner from A_RDD, right?

2)或者,如果我像在第一个示例中那样保留它,则aggregateByKey()将首先按键聚合每个分区,然后以更多的方式发送每个已聚合"(键,值)对正确的分区的有效方法?

2)Or If i leave it as in the first example,aggregateByKey() will aggregate every partition by key first, and then send every "aggregated" (key, value) pair in a more efficient way to the right partition?

3)为什么RDD上的map,flatMap和其他转换不能接受关于如何动态分配(键,值)对的争论? 我的意思是例如在每个元组的map()操作期间,让=>将此元组也发送到特定分区 已由地图e.x上的partitioner参数指定的地图:map(,Partitioner).

3)Why doesn't map,flatMap and other transformations on RDDs canNOT take an argument on how to partition the (key, value) pairs on the fly? What I mean is for example during the map() operation on every tuple lets say, => to send also this tuple to a specific partition that has been designated by a partitioner argument on map e.x: map( , Partitioner).

我正在尝试掌握AggregateByKey()的工作原理,但是每当我认为得到这一点时,就会出现一个新问题…… 预先感谢.

I am trying to grasp the concept of aggregateByKey() how it works, but every time i think i got this, a new question arises... Thanks in advance.

推荐答案

  • 如果将partitionBy放在aggregateByKey之前,通常效率不如单独使用aggregateByKey.您可以有效地禁用地图侧合并.
  • 如果您离开,将会有地图侧联合收割机,而且通常更高效.
  • 非改组操作不会占用分区,因为没有数据移动.操作在每台计算机上本地执行.
    • If you put partitionBy before aggregateByKey it typically will be less efficient than aggregateByKey alone. You effectively disable map side combine.
    • If you leave there will be map side combine and it is typically more efficient.
    • Non shuffling operations don't take partitioner because there is no data movement. Operations are performed locally on each machine.
    • 这篇关于AggregateByKey分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆