我可以在与Kafka Broker相同的机器上运行Kafka Streams Application吗? [英] Can I run Kafka Streams Application on the same machine as of Kafka Broker?

查看:77
本文介绍了我可以在与Kafka Broker相同的机器上运行Kafka Streams Application吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Kafka Streams应用程序,该应用程序从几个主题中获取数据并将这些数据加入另一个主题中.

I have a Kafka Streams Application which takes data from few topics and joins the data and puts it in another topic.

Kafka配置:

5 kafka brokers
Kafka Topics - 15 partitions and 3 replication factor. 

注意:我在运行Kafka代理的同一台计算机上运行Kafka Streams应用程序.

每小时消耗/产生几百万条记录.每当我让任何kafka经纪人破产时,它都会进入重新平衡状态,大约需要花费时间. 30分钟甚至更长的时间来进行重新平衡,并且很多时候它会杀死许多Kafka Streams进程.

Few millions of records are consumed/produced every hour. Whenever I take any kafka broker down, it goes into rebalancing and it takes approx. 30 minutes or sometimes even more for rebalancing and many times it kills many of the Kafka Streams processes.

推荐答案

在标题中回答问题:

从Spark/HDFS的背景来看,我认为这是一种思路上的改变,因为您习惯于将数据放在何处进行处理以利用数据局部性是一件好事.在这里,代理将提供数据位置,但必须将数据发送到Kafka Streams集群进行处理(失去了某些好处).但是,将它们分开可以使您分别管理两个集群.

Coming from a Spark/HDFS background, I think this is a change of thinking, since you are used to think that it is good to have your processing where your data is, to take advantage of data locality. Here, the broker will provide the data locality but will have to send the data to Kafka Streams cluster for processing (losing some of its benefits). However, keeping them separate allows you to manage both clusters separately.

如果您想到一个运行高延迟处理作业,共享数据和处理的集群(例如HDFS + YARN集群),则可以得到数据所在的进程",而不是相反.您可以为数据处理分配资源-但您的想法是,您的处理不取决于临时数据峰值(与Streaming一样),而是取决于总数据量.如果数据增长,您的计算将花费更多,并且您可以分配更多的资源,但是它们将同时增长.但是,在流应用程序上,必要的处理能力确实取决于数据峰值(和您的低延迟要求),而不取决于总数据量,因此有必要对存储和处理分别进行规模调整和管理,因为它们的弹性需求并不大.基于相同的尺寸.

If you think of a cluster that runs high-latency processing jobs, that shares data + processing (e.g. an HDFS + YARN cluster), you can get "the process where data is" and not the opposite. You can allocate resources for your data processing - but the idea is that your processing does not depend on temporary data spikes (as it does with Streaming) but on the total data volumes. If your data grows, your calculations will take more, and you can allocate more resources, but they will grow at the same time. However, on a streaming application, necessary processing power does depend on data spikes (and your low-latency requirements) and not on total data volumes, so it makes sense that storage and processing are dimensioned and managed separately, since their elasticity demands are not based on the same dimension.

这与明显的事实不同,即在同一节点中同时进行数据处理-Kafka代理-和数据处理-Kafka Streams会给节点增加更多的负载,但是我们假设在确定尺寸时要考虑到这一点节点.

This comes apart from the obvious fact that having both data handling - Kafka broker - and data processing - Kafka Streams in the same node puts more load into a node, but we are assuming here this has been taken into account when dimensioning your nodes.

这篇关于我可以在与Kafka Broker相同的机器上运行Kafka Streams Application吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆