Spark-是否可以控制分区到节点的放置? [英] Spark - Is it possible to control placement of partitions to nodes?

查看:110
本文介绍了Spark-是否可以控制分区到节点的放置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark中,可以为RDD提供自定义的Partitioner.通常,将产生的分区随机分配给一组工人.例如,如果我们有20个分区和4个工作程序,则每个工作程序将(大约)获得5个分区.但是,将分区分配给工作者(节点)的操作似乎是随机的,如下表所示.

In Spark, custom Partitioners can be supplied for RDD's. Normally, the produced partitions are randomly distributed to set of workers. For example if we have 20 partitions and 4 workers, each worker will (approximately) get 5 partitions. However the placement of partitions to workers (nodes) seems random like in the table below.

          trial 1    trial 2
worker 1: [10-14]    [15-19]
worker 2: [5-9]      [5-9]  
worker 3: [0-4]      [10-14]
worker 4: [15-19]    [0-4]  

这对于单个RDD上的操作很好,但是当您使用跨越多个RDD的join()cogroup()操作时,这些节点之间的通信成为瓶颈.我将对多个RDD使用相同的分区程序,并希望确保它们最终会出现在同一节点上,因此后续的join()不会很昂贵.是否可以控制将分区放置到工作程序(节点)上?

This is fine for operations on a single RDD, but when you are using join() or cogroup() operations that span multiple RDD's, the communication between those nodes becomes a bottleneck. I would use the same partitioner for multiple RDDs and want to be sure they will end up on the same node so the subsequent join() would not be costly. Is it possible to control the placement of partitions to workers (nodes)?

          desired
worker 1: [0-4]
worker 2: [5-9]
worker 3: [10-14]
worker 4: [15-19]

推荐答案

我将同一分区程序用于多个RDD,并希望确保 它们将最终出现在同一节点上,因此后续的join()不会 昂贵.

I would use the same partitioner for multiple RDDs and want to be sure they will end up on the same node so the subsequent join() would not be costly.

这是处理RDD之间的联接的正确方法,从而确保要联接的记录位于同一分区/执行器中.

This is the right way to handle joins between RDDs so that records to be joined are ensured to be in the same partition/executor.

是否可以控制对工作人员的分区放置 (节点)

Is it possible to control the placement of partitions to workers (nodes)

不可能为每个分区明确指定工作节点.这将破坏为Spark或任何其他并行计算框架(如Map-Reduce/Tez等)定义的并行计算的抽象.

It is not possible to explicitly specify the worker node for each partition. This would break the abstractions of parallel computation defined for Spark or any other parallel computation frameworks like Map-Reduce/Tez etc.

Spark和其他并行计算框架旨在容错.因此,这意味着如果一小部分工作程序节点发生故障,则将其替换为其他工作程序节点,并且此过程对于用户应用程序是透明的.

Spark and other parallel computation frameworks are designed to fault tolerant. So this means if a small subset of worker nodes fail, then are replaced with other worker nodes and this process occurs transparently to the user application.

如果允许用户在应用程序中显式引用工作程序节点,则这些抽象将中断.控制RDD分区放置的唯一方法是为RDD分区程序指定自己的分区.

These abstractions would break if a user is allowed to explicitly refer a worker-node in the application. The only means of governing the placement of a partition of RDD is by specifying your own partitions for the RDD partitioner.

这篇关于Spark-是否可以控制分区到节点的放置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆