有没有办法在Storm中应用多个分组? [英] Is there a way to apply multiple groupings in storm?

查看:72
本文介绍了有没有办法在Storm中应用多个分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对拓扑应用字段分组"以及本地或随机分组",这样每个喷口仅将数据发送到本地螺栓,而且还使用我文档中的字段来决定应该使用的本地螺栓

I want to apply "Fields grouping" as well as "Local or shuffle grouping" to my topology such that each spout sends data to local bolts only but also uses a field in my document to decide what local-bolts it should go to.

因此,如果有两个工作进程每个都有1个Kafka-Spout和2个Elastic-Search-bolt,则local-or-shuffle分组将为我提供以下信息:

So if there were two worker processes each with 1 Kafka-Spout and 2 elastic-search-bolts, local-or-shuffle grouping gives me the following:

Each KS ---> Two local ES-Bolts

fields-grouping为我提供了以下内容:

fields-grouping gives me the following:

Each KS ---> Possibly all 4 ES-bolts, depending on the value of the field

但是我想要以下内容:

Each KS ---> Two local ES-bolts only, but distribution among these
             local bolts should depend on the value of the field

其中:

KS = Kafka-Spout

KS = Kafka-Spout

ES =弹性搜索

我想这样做,以便可以将单个分片的所有文档组合到ES螺栓中.这样,由ES-bolt发送的批次将不会被ES-server进一步分割,因为所有这些文档的目标分片都将是相同的(我计划将 destination_shard 字段添加到文档中,以用于字段-级别分组和destination_shard将计算为 Murmurm3.hash(ID)%numShards ).

I want to do this so that I can group all the documents for a single shard together in ES-bolt. This way the batch sent by an ES-bolt will not be split further by the ES-server as all those document's destination shard will be the same (I plan to add field destination_shard to the documents for field-level grouping and destination_shard will be calculated as Murmurm3.hash(ID)%numShards).

然后我不需要任何进程间通信,因此需要本地或随机分组"

And then I do not want any inter-process communication, hence the need for "local or shuffle grouping"

感谢您的帮助!

推荐答案

否和是.

没有分组值可以满足您的需求,但是您可以使用以下方法实现分组:

There is no grouping value that does what you want, but you can implement that grouping yourself using:

1)定向流,您可以在其中指定bolt实例的任务ID来处理元组(而不是让Storm找出来)

1) Directed streams, in which you specify the task id of the bolt instance to process the tuple (rather than let Storm figure it out)

2)拓扑上下文在启动时传递给每个螺栓和喷口.该对象可以告诉您当前工作线程上正在运行哪些任务(使用getThisWorkerTasks())以及哪些螺栓具有哪些任务(getComponentTasks())

2) The topology context passed to each bolt and spout on startup. That object can tell you which tasks are running on the current worker (using getThisWorkerTasks()) and what bolts have which tasks (getComponentTasks())

3)如上所述的您自己的分区逻辑,该逻辑利用上面(2)中的信息为每个螺栓的出站元组指定特定的目标任务.

3) Your own partitioning logic as you've described above and which makes use of the info in (2) above to specify the specific target task for each of your bolt's outbound tuples.

这篇关于有没有办法在Storm中应用多个分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆