在运行时部署流处理拓扑? [英] Deploy stream processing topology on runtime?
问题描述
全部,
我有一个需要重新记录一些较旧数据的要求.我们有一个多阶段的管道,其来源是一个Kafka主题.一旦记录被输入,它就将经历一系列步骤(大约10个).每个步骤都会对推送到源主题的原始JSON对象进行按摩,并推送到目标主题.
I have a requirement where in I need to re-ingest some of my older data. We have a multi staged pipeline , the source of which is a Kafka topic. Once a record is fed into that, it runs through a series of steps(about 10). Each step massages the original JSON object pushed to the source topic and pushes to a destination topic.
现在,有时候,我们需要重新摄取较旧的数据,并应用我上面描述的步骤的子集.我们打算将这些记录最多的记录推送到另一个主题,以免阻止正在通过的实时"数据,这可能意味着我可能只需要应用上述10个步骤中的1个步骤即可.从上方在整个管道中运行它非常浪费,因为每个步骤都占用大量资源,并需要多个外部服务.另外,我可能需要一次重新记录数百万个条目,因此可能会阻塞我的外部服务.最后,这些重新安排活动并不那么频繁,有时可能每周仅发生一次.
Now, sometimes, we need to re ingest the older data and apply a subset of the steps I described above. We intend to push these re-ingest records to a different topic, so as not to block the "live" data that's coming through, This might mean I may need to apply just 1 step from the 10 I described above. Running it through the entire pipeline from above is wasteful, as each step is pretty resource intensive and calls multiple external services. Also, I might need to re-ingest millions of entries at time, so I might choke my external services. Lastly, these re ingsetion activities are not that frequent and may happen just once a week at times.
让我们说一下是否能够找出需要执行的步骤.这可以通过基本规则引擎来完成.完成此操作后,我需要能够动态创建拓扑/能够部署拓扑,并从新创建的主题开始进行处理.同样,我想在运行时上进行部署的原因是,尽管这些活动对业务至关重要,但它们并不会那么频繁地发生.而且每次,我需要执行的步骤可能都会更改,因此我们不能总是让整个管道都在运行.
Let's say if I am able to figure out what steps from above I need to execute. That can be done via a basic rules engine. Once that has been done, then I need to be able to dynamically create a topology/ be able to deploy it which starts processing from the newly created topic. Again, the reason I want to deploy on runtime is that these activities, albeit business critical, don't happen that frequently. And every time, the steps I need to execute might change, so we can't always have the entire pipeline running.
有没有办法做到这一点?还是我什至在朝着正确的方向思考,即我上面概述的方法是否正确?任何指针都会有所帮助.
Is there a way to achieve this? Or may be am I even thinking in the right direction i.e is the approach I outlined above is even correct? Any pointers would be helpful.
推荐答案
我建议您为重新摄取的数据创建这些动态拓扑,作为单独的Kafka Streams应用程序.而且,如果您想以编程方式即时创建此类应用程序并在完成后将其终止,请考虑以下方法:
I'd suggest you creating those dynamic topologies for re-ingested data as a separate Kafka Streams application. And if you want to programmatically create such applications on-the-fly and terminate them when done consider the following ways:
- 使每个步骤都是可配置的:您可能会传入一个旋钮参数列表,并根据它们来动态创建最复杂的拓扑.
- 如果要自动触发此类重新摄入管道,请考虑使用一些集成的部署工具来调用KafkaStreams#start.
这篇关于在运行时部署流处理拓扑?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!