Apache Flink中的Keyby数据分布是逻辑操作员还是物理操作员? [英] Keyby data distribution in Apache Flink, Logical or Physical Operator?

查看:250
本文介绍了Apache Flink中的Keyby数据分布是逻辑操作员还是物理操作员?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据Apache Flink文档,KeyBy转换在逻辑上将流划分为不相交的分区.具有相同键的所有记录都分配给相同的分区.

According to the Apache Flink documentation, KeyBy transformation logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.

KeyBy是否100%进行逻辑转换?它不包括用于在群集节点之间分布的物理数据分区吗?如果是这样,那么如何保证所有具有相同键的记录都分配给相同的分区?

Is KeyBy 100% logical transformation? Doesn't it include physical data partitioning for distribution across the cluster nodes? If so, then how it can guarantee that all the records with the same key are assigned to the same partition?

例如,假设我们从n个节点的Apache Kafka集群中获取分布式数据流.运行我们的流作业的Apache Flink集群由m个节点组成.将keyBy转换应用于传入数据流时,如何保证逻辑数据分区?还是涉及跨群集节点的物理数据分区?

For instance, assuming that we are getting a distributed data stream from Apache Kafka cluster of n nodes. Apache Flink cluster running our streaming job consists of m nodes. When the keyBy transformation is applied on the incoming data stream, how does it guarantees logical data partitioning? Or does it involve physical data partitioning across the cluster nodes?

似乎我对逻辑和物理数据分区感到困惑.

It seems I am confused between logical and physical data partitioning.

推荐答案

所有可能的键的键空间都划分为一定数量的键组.密钥组的数量(与最大并行度相同)是在设置Flink集群时可以设置的配置参数.默认值为128.

The keyspace of all possible keys is divided into some number of key groups. The number of key groups (which is the same as the maximum parallelism) is a configuration parameter you can set when setting up a Flink cluster; the default value is 128.

每个密钥恰好属于一个密钥组.启动集群时,将为每个任务管理器分配一些特定的密钥组-如果从检查点或保存点启动集群,则按快照组对这些快照进行索引,并且每个任务管理器将密钥中的密钥状态加载分组.

Each key belongs to exactly one key group. When a cluster is launched, each task manager is assigned some specific key groups -- and if the cluster is started from a checkpoint or savepoint, those snapshots are indexed by key group, and each task manager loads the state for the keys in the key groups it has been assigned.

在运行作业时,每个任务管理器都知道用于计算键的键选择器功能,以及键如何映射到键组上. TM还知道将密钥组分配给任务管理器.这样可以很容易地将每条消息路由到负责该消息密钥的任务管理器.

While a job is running, each task manager knows the key selector functions used to compute the keys, and how keys map onto key groups. The TMs also know the partitioning of key groups to task managers. This makes it straightforward to route each message to the task manager responsible for that message's key.

这篇关于Apache Flink中的Keyby数据分布是逻辑操作员还是物理操作员?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆