集群中的 Apache Flink 流不会与工作人员拆分作业 [英] Apache Flink streaming in cluster does not split jobs with workers

查看:25
本文介绍了集群中的 Apache Flink 流不会与工作人员拆分作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是建立一个使用 Kafka 作为源的高吞吐量集群 &Flink 作为流处理引擎.这是我所做的.

My objective is to setup a high throughput cluster using Kafka as source & Flink as the stream processing engine. Here's what I have done.

我已经在主节点和工作节点上设置了一个 2 节点集群,以下配置.

I have setup a 2-node cluster the following configuration on the master and the workers.

掌握 flink-conf.yaml

jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost

jobmanager.rpc.port: 6123

jobmanager.heap.mb: 256

taskmanager.heap.mb: 512

taskmanager.numberOfTaskSlots: 50

parallelism.default: 100

Worker flink-conf.yaml

jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost

jobmanager.rpc.port: 6123

jobmanager.heap.mb: 512 #256

taskmanager.heap.mb: 1024 #512

taskmanager.numberOfTaskSlots: 50

parallelism.default: 100

Master 节点上的 slaves 文件如下所示:

The slaves file on the Master node looks like this:

<WORKER_IP_ADDR>
localhost

两个节点上的 flink 设置都在同名的文件夹中.我通过运行在 master 上启动集群

The flink setup on both nodes is in a folder which has the same name. I start up the cluster on the master by running

bin/start-cluster-streaming.sh

这会在 Worker 节点上启动任务管理器.

This starts up the task manager on the Worker node.

我的输入源是 Kafka.这是片段.

My input source is Kafka. Here is the snippet.

final StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();

DataStreamSource<String> stream = 
    env.addSource(
    new KafkaSource<String>(kafkaUrl,kafkaTopic, new SimpleStringSchema()));
stream.addSink(stringSinkFunction);

env.execute("Kafka stream");

这是我的接收器函数

public class MySink implements SinkFunction<String> {

    private static final long serialVersionUID = 1L;

    public void invoke(String arg0) throws Exception {
        processMessage(arg0);
        System.out.println("Processed Message");
    }
}

这是我的 pom.xml 中的 Flink 依赖项.

Here are the Flink Dependencies in my pom.xml.

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-core</artifactId>
    <version>0.9.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>0.9.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka</artifactId>
    <version>0.9.0</version>
</dependency>

然后我在master上用这个命令运行打包的jar

Then I run the packaged jar with this command on the master

bin/flink run flink-test-jar-with-dependencies.jar

但是,当我将消息插入到 Kafka 主题时,我能够在主节点上处理来自我的 Kafka 主题的所有消息(通过我的 SinkFunction 实现的 invoke 方法中的调试消息)一个人.

However when I insert messages into the Kafka topic I am able to account for all messages coming in from my Kafka topic (via debug messages in the invoke method of my SinkFunction implementation) on the Master node alone.

在作业管理器 UI 中,我可以看到 2 个任务管理器,如下所示:

In the Job manager UI I am able to see 2 Task managers as below:

还有仪表板看起来像这样:问题:

Also The dashboard looks like so : Questions:

  1. 为什么工作节点没有得到任务?
  2. 我是否缺少一些配置?

推荐答案

在 Flink 中读取 Kafka 源时,源任务的最大并行度受给定 Kafka 主题的分区数限制.Kafka 分区是 Flink 中源任务可以使用的最小单位.如果分区比源任务多,那么有些任务会消耗多个分区.

When reading from a Kafka source in Flink, the maximum degree of parallelism for the source task is limited by the number of partitions of a given Kafka topic. A Kafka partition is the smallest unit which can be consumed by a source task in Flink. If there are more partitions than source tasks, then some tasks will consume multiple partitions.

因此,为了向所有 100 个任务提供输入,您应该确保您的 Kafka 主题至少有 100 个分区.

Consequently, in order to supply input to all of your 100 tasks, you should assure that your Kafka topic has at least 100 partitions.

如果您无法更改主题的分区数,那么也可以使用 setParallelism 方法使用较低的并行度从 Kafka 进行初始读取.或者,您可以使用 rebalance 方法,该方法将在前一操作的所有可用任务中打乱您的数据.

If you cannot change the number of partitions of your topic, then it is also possible to initially read from Kafka using a lower degree of parallelism using the setParallelism method. Alternatively, you can use the rebalance method which will shuffle your data across all available tasks of the preceding operation.

这篇关于集群中的 Apache Flink 流不会与工作人员拆分作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆