在运行时部署流处理拓扑? [英] Deploy stream processing topology on runtime?

查看:27
本文介绍了在运行时部署流处理拓扑?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

我有一个要求,我需要重新提取一些旧数据.我们有一个多阶段管道,其来源是一个 Kafka 主题.一旦将记录输入其中,它就会运行一系列步骤(大约 10 个).每一步都会对推送到源主题的原始 JSON 对象进行按摩,然后推送到目标主题.

I have a requirement where in I need to re-ingest some of my older data. We have a multi staged pipeline , the source of which is a Kafka topic. Once a record is fed into that, it runs through a series of steps(about 10). Each step massages the original JSON object pushed to the source topic and pushes to a destination topic.

现在,有时,我们需要重新摄取旧数据并应用我上面描述的步骤的一个子集.我们打算将这些重新摄取记录推送到不同的主题,以免阻止通过的实时"数据,这可能意味着我可能只需要应用上面描述的 10 个步骤中的 1 个步骤.从上面通过整个管道运行它是浪费的,因为每个步骤都非常占用资源并调用多个外部服务.此外,我可能需要一次重新摄取数百万个条目,因此我可能会阻塞我的外部服务.最后,这些重新摄入活动并不那么频繁,有时可能每周只发生一次.

Now, sometimes, we need to re ingest the older data and apply a subset of the steps I described above. We intend to push these re-ingest records to a different topic, so as not to block the "live" data that's coming through, This might mean I may need to apply just 1 step from the 10 I described above. Running it through the entire pipeline from above is wasteful, as each step is pretty resource intensive and calls multiple external services. Also, I might need to re-ingest millions of entries at time, so I might choke my external services. Lastly, these re ingsetion activities are not that frequent and may happen just once a week at times.

假设我是否能够从上面找出我需要执行的步骤.这可以通过基本规则引擎来完成.完成后,我需要能够动态创建拓扑/能够部署它,从新创建的主题开始处理.同样,我想在运行时部署的原因是这些活动虽然是关键业务,但不会经常发生.而且每次我需要执行的步骤都可能会发生变化,所以我们不可能总是让整个管道都在运行.

Let's say if I am able to figure out what steps from above I need to execute. That can be done via a basic rules engine. Once that has been done, then I need to be able to dynamically create a topology/ be able to deploy it which starts processing from the newly created topic. Again, the reason I want to deploy on runtime is that these activities, albeit business critical, don't happen that frequently. And every time, the steps I need to execute might change, so we can't always have the entire pipeline running.

有没有办法做到这一点?或者我什至在思考正确的方向,即我上面概述的方法甚至是正确的?任何指针都会有所帮助.

Is there a way to achieve this? Or may be am I even thinking in the right direction i.e is the approach I outlined above is even correct? Any pointers would be helpful.

推荐答案

我建议您创建这些动态拓扑,以便将重新摄取的数据作为单独的 Kafka Streams 应用程序创建.如果您想以编程方式即时创建此类应用程序并在完成后终止它们,请考虑以下方式:

I'd suggest you creating those dynamic topologies for re-ingested data as a separate Kafka Streams application. And if you want to programmatically create such applications on-the-fly and terminate them when done consider the following ways:

  1. 使您的每个步骤都可配置:您可能会传入一个旋钮参数列表,并根据它们动态创建重新摄取拓扑.
  2. 如果您想自动触发此类重新摄取管道,请考虑使用一些集成的部署工具来调用 KafkaStreams#start.

这篇关于在运行时部署流处理拓扑?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆