集群中的Apache Flink流不与工人拆分工作 [英] Apache Flink streaming in cluster does not split jobs with workers

查看:85
本文介绍了集群中的Apache Flink流不与工人拆分工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是使用Kafka作为来源并建立一个高吞吐量集群. Flink作为流处理引擎.这就是我所做的.

My objective is to setup a high throughput cluster using Kafka as source & Flink as the stream processing engine. Here's what I have done.

我已经在主节点和从节点上按照以下配置设置了一个2节点群集.

I have setup a 2-node cluster the following configuration on the master and the slave.

主flink-conf.yaml

jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost

jobmanager.rpc.port: 6123

jobmanager.heap.mb: 256

taskmanager.heap.mb: 512

taskmanager.numberOfTaskSlots: 50

parallelism.default: 100

从站flink-conf.yaml

jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost

jobmanager.rpc.port: 6123

jobmanager.heap.mb: 512 #256

taskmanager.heap.mb: 1024 #512

taskmanager.numberOfTaskSlots: 50

parallelism.default: 100

主节点上的从属文件如下:

The slaves file on the Master node looks like this:

<SLAVE_IP_ADDR>
localhost

两个节点上的flink设置位于一个具有相同名称的文件夹中.我通过运行在主服务器上启动集群

The flink setup on both nodes is in a folder which has the same name. I start up the cluster on the master by running

bin/start-cluster-streaming.sh

这将启动从属节点上的任务管理器.

This starts up the task manager on the slave node.

我的输入源是Kafka.这是代码段.

My input source is Kafka. Here is the snippet.

final StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();

DataStreamSource<String> stream = 
    env.addSource(
    new KafkaSource<String>(kafkaUrl,kafkaTopic, new SimpleStringSchema()));
stream.addSink(stringSinkFunction);

env.execute("Kafka stream");

这是我的接收器功能

public class MySink implements SinkFunction<String> {

    private static final long serialVersionUID = 1L;

    public void invoke(String arg0) throws Exception {
        processMessage(arg0);
        System.out.println("Processed Message");
    }
}

这是我pom.xml中的Flink依赖项.

Here are the Flink Dependencies in my pom.xml.

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-core</artifactId>
    <version>0.9.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>0.9.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka</artifactId>
    <version>0.9.0</version>
</dependency>

然后我在主服务器上使用此命令运行打包的jar

Then I run the packaged jar with this command on the master

bin/flink run flink-test-jar-with-dependencies.jar

但是,当我将消息插入Kafka主题时,仅在主节点上,我就能够解决所有来自我的Kafka主题的消息(通过SinkFunction实现的invoke方法中的调试消息).

However when I insert messages into the Kafka topic I am able to account for all messages coming in from my Kafka topic (via debug messages in the invoke method of my SinkFunction implementation) on the Master node alone.

在作业管理器用户界面中,我可以看到2个任务管理器,如下所示:

In the Job manager UI I am able to see 2 Task managers as below:

此外,仪表板也如下所示: 问题:

Also The dashboard looks like so : Questions:

  1. 为什么从属节点无法执行任务?
  2. 我缺少一些配置吗?

推荐答案

在Flink中从Kafka源读取时,源任务的最大并行度受给定Kafka主题的分区数限制. Kafka分区是Flink中的源任务可以使用的最小单元.如果分区多于源任务,那么某些任务将占用多个分区.

When reading from a Kafka source in Flink, the maximum degree of parallelism for the source task is limited by the number of partitions of a given Kafka topic. A Kafka partition is the smallest unit which can be consumed by a source task in Flink. If there are more partitions than source tasks, then some tasks will consume multiple partitions.

因此,为了向所有100个任务提供输入,您应确保您的Kafka主题至少具有100个分区.

Consequently, in order to supply input to all of your 100 tasks, you should assure that your Kafka topic has at least 100 partitions.

如果您无法更改主题的分区数,则还可以使用setParallelism方法以较低的并行度从Kafka进行初始读取.另外,您可以使用rebalance方法,该方法将在先前操作的所有可用任务之间重新整理数据.

If you cannot change the number of partitions of your topic, then it is also possible to initially read from Kafka using a lower degree of parallelism using the setParallelism method. Alternatively, you can use the rebalance method which will shuffle your data across all available tasks of the preceding operation.

这篇关于集群中的Apache Flink流不与工人拆分工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆