Storm-Kafka 多个 spout,如何分担负载? [英] Storm-Kafka multiple spouts, how to share the load?

查看:22
本文介绍了Storm-Kafka 多个 spout,如何分担负载?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在多个 spout 之间共享任务.我有一种情况,我一次从外部来源获取一个元组/消息,并且我想要一个 spout 的多个实例,背后的主要目的是共享负载并提高性能效率.

I am trying to share the task among the multiple spouts. I have a situation, where I'm getting one tuple/message at a time from external source and I want to have multiple instances of a spout, main intention behind is to share the load and increase performance efficiency.

我可以用一个 Spout 本身做同样的事情,但我想在多个 Spout 之间分担负载.我无法获得分散负载的逻辑.因为直到特定的 spout 完成消耗部分(即基于缓冲区大小设置)才会知道消息的偏移量.

I can do the same with one Spout itself, but I want to share the load across multiple spouts. I am not able to get the logic to spread the load. Since the offset of messages will not be known until the particular spout finishes the consuming the part (i.e based on buffer size set).

任何人都可以对如何解决逻辑/算法提出一些建议吗?

Can anyone please put some bright light on the how to work-out on the logic/algorithm?

提前感谢您的时间.


根据答案更新:
现在在 Kafka 上使用了多分区(即 5)
以下是使用的代码:
builder.setSpout("spout", new KafkaSpout(cfg), 5);

通过在每个分区上填充 800 MB 数据进行测试,完成读取需要 ~22 秒.

Tested by flooding with 800 MB data on each partition and it took ~22 sec to finish read.

再次使用parallelism_hint = 1的代码
builder.setSpout("spout", new KafkaSpout(cfg), 1);

Again, used the code with parallelism_hint = 1
i.e. builder.setSpout("spout", new KafkaSpout(cfg), 1);

现在需要更多~23 sec!为什么?

Now it took more ~23 sec! Why?

根据 Storm Docs setSpout() 声明如下:

According to Storm Docs setSpout() declaration is as follows:

public SpoutDeclarer setSpout(java.lang.String id,
                              IRichSpout spout,
                              java.lang.Number parallelism_hint)

哪里,
parallelism_hint - 是应该分配给执行此 spout 的任务数.每个任务都将在集群周围某个进程中的一个线程上运行.

where,
parallelism_hint - is the number of tasks that should be assigned to execute this spout. Each task will run on a thread in a process somewhere around the cluster.

推荐答案

我在 storm-user 讨论类似的东西.

I had come across a discussion in storm-user which discuss something similar.

阅读Spout 并行度与 kafka 分区数量之间的关系.

使用 kafka-spout 进行 Storm 时需要注意的 2 件事

  1. 您可以在 KafkaSpout 上拥有的最大并行度是分区数.
  2. 我们可以将负载拆分为多个 kafka 主题,并且每个主题都有单独的 spout 实例.IE.每个 spout 处理一个单独的主题.
  1. The maximum parallelism you can have on a KafkaSpout is the number of partitions.
  2. We can split the load into multiple kafka topics and have separate spout instances for each. ie. each spout handling a separate topic.

所以如果我们有一个情况,每个主机的 kafka 分区配置为 1,主机数量为 2.即使我们将 spout 并行度设置为 10,预计的最大值也只会是 2,也就是数量分区.

So if we have a case where kafka partitions per host is configured as 1 and the number of hosts is 2. Even if we set the spout parallelism as 10, the max value which is repected will only be 2 which is the number of partitions.

如何提及Kafka-spout中的分区数?

List<HostPort> hosts = new ArrayList<HostPort>();
hosts.add(new HostPort("localhost",9092));
SpoutConfig objConfig=new SpoutConfig(new KafkaConfig.StaticHosts(hosts, 4), "spoutCaliber", "/kafkastorm", "discovery");

如您所见,此处可以使用 hosts.add 添加代理,并且在 new KafkaConfig.StaticHosts(hosts) 中将分区编号指定为 4, 4) 代码片段.

As you can see, here brokers can be added using hosts.add and the partion number is specified as 4 in the new KafkaConfig.StaticHosts(hosts, 4) code snippet.

如何在 Kafka-spout 中提及并行性提示?

builder.setSpout("spout", spout,4);

您可以在使用 setSpout 方法将 spout 添加到拓扑中时提及相同的内容.这里4 是并行提示.

You can mention the same while adding your spout into the topology using setSpout method. Here 4 is the parallelism hint.

更多可能有帮助的链接

理解并行-a-Storm-拓扑

what-is-the-task-in-twitter-storm-并行性

免责声明:!!我是 Storm 和 Java 的新手!!!!所以请编辑/添加如果它需要一些地方.

Disclaimer: !! i am new to both storm and java !!!! So pls edit/add if its required some where.

这篇关于Storm-Kafka 多个 spout,如何分担负载?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆