不平衡的 Flink 流媒体负载 [英] Unbalanced Flink Streaming Load

查看:37
本文介绍了不平衡的 Flink 流媒体负载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

https://imgur.com/jdisF4T

我有一个 4 节点的独立 Flink 集群.每个节点(TM A、TM B、TM C、TM D)上都有一个TaskManager,每个TaskManager有2个槽(A1、A2、B1、...、D2).

I have a 4 nodes standalone Flink cluster. There is a TaskManager on every node (TM A, TM B, TM C, TM D) and every TaskManager has 2 slots (A1, A2, B1, ..., D2).

作业的源以并行度 8 运行.源代码有 6 个 map/flatMap(它们都是 2 杆).

The source of the job runs with parallelism 8. There are 6 map/flatMap from the source (all of them with par 2).

在检查流程时发现所有 flatMap 操作都使用来自同一个 TM 的槽(没关系),但整个工作只使用了 2 个 TM.所以负载很不平衡.

While checking the flow realised that all of the flatMap operations are using slot form the same TM (that's OK), but the overall job using only 2 of the TMs. So the load is very unbalanced.

为什么会出现这种行为?如何平衡负载?

Why is this behaviour? How can I balance the load?

推荐答案

有几个相关因素:

  1. 默认情况下,每当一个操作符直接转发给下一个操作符时,这些操作符就会链接在一起以避免序列化和网络开销.
  2. 默认情况下,槽数等于最大并行度,每个槽被分配执行应用程序的一个完整切片(每个运算符的一个实例).如果您想更好地控制任务到槽的分配,您可以设置槽共享组,以将特定操作员或操作员组隔离到它们自己的槽中.
  3. Flink 调度器将任务分配给任务槽而不考虑位置——它只考虑槽,而不是任务管理器.有一些关于如何更好地将负载分散到像您这样的情况的可用机器上的讨论 - 请参阅 https://issues.apache.org/jira/browse/FLINK-11815——以及关于提供更明确的控制——请参阅https://issues.apache.org/jira/browse/FLINK-11166.
  1. By default, whenever one operator forwards directly to the next, those operators are chained together to avoid serialization and networking overhead.
  2. By default, the number of slots equals the maximum parallelism, and each slot is assigned to execute one complete slice of the application (one instance of each operator). If you want more control over the assignment of tasks to slots, you can set up slot sharing groups to isolate particular operators or groups of operators into their own slot(s).
  3. The Flink scheduler assigns tasks to task slots without giving any thought to locality -- it only thinks in terms of slots, not task managers. There's been some discussion about doing a better job of spreading out the load across the available machines for cases like yours -- see https://issues.apache.org/jira/browse/FLINK-11815 -- and about providing more explicit control -- see https://issues.apache.org/jira/browse/FLINK-11166.

这篇关于不平衡的 Flink 流媒体负载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆