有没有办法在风暴中应用多个分组? [英] Is there a way to apply multiple groupings in storm?

查看:26
本文介绍了有没有办法在风暴中应用多个分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对我的拓扑应用字段分组"以及本地或随机分组",这样每个 spout 只将数据发送到本地螺栓,但也使用我文档中的字段来决定它应该使用哪些本地螺栓

I want to apply "Fields grouping" as well as "Local or shuffle grouping" to my topology such that each spout sends data to local bolts only but also uses a field in my document to decide what local-bolts it should go to.

因此,如果有两个工作进程,每个进程有 1 个 Kafka-Spout 和 2 个弹性搜索螺栓,则 local-or-shuffle 分组给我以下内容:

So if there were two worker processes each with 1 Kafka-Spout and 2 elastic-search-bolts, local-or-shuffle grouping gives me the following:

Each KS ---> Two local ES-Bolts

fields-grouping 给了我以下内容:

fields-grouping gives me the following:

Each KS ---> Possibly all 4 ES-bolts, depending on the value of the field

但我想要以下内容:

Each KS ---> Two local ES-bolts only, but distribution among these
             local bolts should depend on the value of the field

哪里:

KS = Kafka 喷口

KS = Kafka-Spout

ES = 弹性搜索

我想这样做,以便我可以在 ES-bolt 中将单个分片的所有文档组合在一起.这样 ES-bolt 发送的批处理不会被 ES-server 进一步拆分,因为所有这些文档的目标分片都是相同的(我计划将字段 destination_shard 添加到字段的文档中-级别分组和destination_shard 将计算为Murmurm3.hash(ID)%numShards.

I want to do this so that I can group all the documents for a single shard together in ES-bolt. This way the batch sent by an ES-bolt will not be split further by the ES-server as all those document's destination shard will be the same (I plan to add field destination_shard to the documents for field-level grouping and destination_shard will be calculated as Murmurm3.hash(ID)%numShards).

然后我不想要任何进程间通信,因此需要本地或随机分组"

And then I do not want any inter-process communication, hence the need for "local or shuffle grouping"

感谢您的帮助!

推荐答案

否和是.

没有可以满足您要求的分组值,但您可以使用以下方法自己实现该分组:

There is no grouping value that does what you want, but you can implement that grouping yourself using:

1) 定向流,在其中指定bolt 实例的任务id 来处理元组(而不是让Storm 弄清楚)

1) Directed streams, in which you specify the task id of the bolt instance to process the tuple (rather than let Storm figure it out)

2) 启动时传递给每个 bolt 和 spout 的拓扑上下文.该对象可以告诉您当前 worker 上正在运行哪些任务(使用 getThisWorkerTasks())以及哪些 bolt 具有哪些任务(getComponentTasks())

2) The topology context passed to each bolt and spout on startup. That object can tell you which tasks are running on the current worker (using getThisWorkerTasks()) and what bolts have which tasks (getComponentTasks())

3) 如上所述的您自己的分区逻辑,它利用上面 (2) 中的信息为每个 bolt 的出站元组指定特定的目标任务.

3) Your own partitioning logic as you've described above and which makes use of the info in (2) above to specify the specific target task for each of your bolt's outbound tuples.

这篇关于有没有办法在风暴中应用多个分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆