Kafka拓扑最佳做法 [英] Kafka Topology Best Practice

查看:86
本文介绍了Kafka拓扑最佳做法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有4台机器,其中的Kafka群集配置了拓扑结构,每台机器都有一个动物园管理员和两个经纪人.

I have 4 machines where a Kafka Cluster is configured with topology that each machine has one zookeeper and two broker.

通过这种配置,您对最大的主题和分区的最佳性能有何建议?

With this configuration what do you advice for maximum topic&partition for best performance?

复制因子3:使用kafka 0.10.XX

Replication Factor 3: using kafka 0.10.XX

谢谢?

推荐答案

每个主题仅限于100,000个分区,无论有多少节点(截至2017年7月)

Each topic is restricted to 100,000 partitions no matter how many nodes (as of July 2017)

关于主题的数量,该主题的数量取决于机器上最小的RAM有多大.这是由于Zookeeper将所有内容都保留在内存中以便快速访问(而且它不会对znode进行分片,只是在写入时在ZK节点之间进行复制).这实际上意味着一旦耗尽一台计算机的内存,ZK将无法添加更多主题.在Kafka代理节点上达到此限制之前,您很可能会用完文件句柄.

As to the number of topics that depends on how large the smallest RAM is across the machines. This is due to Zookeeper keeping everything in memory for quick access (also it doesnt shard the znodes, just replicates across ZK nodes upon write). This effectively means once you exhaust one machines memory that ZK will fail to add more topics. You will most likely run out of file handles before reaching this limit on the Kafka broker nodes.

要在其网站上引用KAFKA文档(6.1基本的Kafka操作 https://kafka.apache.org/documentation/#basic_ops_add_topic ):

To quote the KAFKA docs on their site (6.1 Basic Kafka Operations https://kafka.apache.org/documentation/#basic_ops_add_topic):

每个分片的分区日志都放置在Kafka日志目录下的自己的文件夹中.此类文件夹的名称由主题名称,破折号(-)和分区ID组成.由于典型的文件夹名称不能超过255个字符,因此主题名称的长度受到限制.我们假设分区的数量永远不会超过100,000.因此,主题名称不能超过249个字符.这在文件夹名称中只留了足够的空间用于破折号和可能为5位数字的长分区ID.

Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation on the length of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 249 characters. This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.

引用Zookeeper文档( https://zookeeper.apache.org/doc/trunk/zookeeperOver.html ):

To quote the Zookeeper docs (https://zookeeper.apache.org/doc/trunk/zookeeperOver.html):

复制的数据库是一个内存数据库,其中包含整个数据树.更新会记录到磁盘以确保可恢复性,而写入操作会先序列化到磁盘,然后再应用到内存数据库中.

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

性能:

取决于您的发布和使用语义,主题-分区范围将发生变化.以下是您应该问自己的一系列问题,以了解潜在的解决方案(您的问题很开放):

Depending on your publishing and consumption semantics the topic-partition finity will change. The following are a set of questions you should ask yourself to gain insight into a potential solution (your question is very open ended):

  • 我发布的数据是否对任务至关重要(即不能丢失,必须确保已发布,并且必须消耗一次)?
  • 我应该使producer.send()调用尽可能地同步,还是继续将异步方法用于批处理(我是否权衡了发布速度的保证)?
  • 我发布的消息是否彼此依赖?消息A是否必须在消息B之前消耗掉(暗示A在B之前发布)?
  • 如何选择要将消息发送到的分区?我应该:将消息分配给一个分区(额外的生产者逻辑),让集群以循环方式决定,或者分配一个将散列到该主题的分区之一的键(需要提供一个均匀分布的散列)以获得跨分区的良好负载平衡)
  • 您应该有几个主题?这与您数据的语义有何关系?为许多不同的逻辑数据域自动创建主题是否有效(想想删除过时主题对Zookeeper的影响和管理上的痛苦)?
  • 分区提供了并行性(可能有更多的消费者),并且可能增加了积极的负载平衡效果(如果生产者正确发布).您是否想将部分问题域元素分配给特定分区(发布时,将客户端A的发送数据发送到分区1)?这有什么副作用(认为可重构性和可维护性)?
  • 您是否要创建超出所需数量的分区,以便在需要时可以与更多经纪人/消费者一起扩大规模?根据您的专业知识,KAFKA集群的自动扩展有多现实?可以手动完成吗?
  • Is the data I am publishing mission critical (i.e. cannot lose it, must be sure I published it, must have exactly once consumption)?
  • Should I make the producer.send() call as synchronous as possible or continue to use the asynchronous method with batching (do I trade-off publishing guarantees for speed)?
  • Are the messages I am publishing dependent on one another? Does message A have to be consumed before message B (implies A published before B)?
  • How do I choose which partition to send my message to? Should I: assign the message to a partition (extra producer logic), let the cluster decide in a round robin fashion, or assign a key which will hash to one of the partitions for the topic (need to come up with an evenly distributed hash to get good load balancing across partitions)
  • How many topics should you have? How is this connected to the semantics of your data? Will auto-creating topics for many distinct logical data domains be efficient (think of the effect on Zookeeper and administrative pain to delete stale topics)?
  • Partitions provide parallelism (more consumers possible) and possibly increased positive load balancing effects (if producer publishes correctly). Would you want to assign parts of your problem domain elements to specific partitions (when publishing send data for client A to partition 1)? What side-effects does this have (think refactorability and maintainability)?
  • Will you want to make more partitions than you need so you can scale up if needed with more brokers/consumers? How realistic is automatic scaling of a KAFKA cluster given your expertise? Will this be done manually? Is manual scaling viable for your problem domain (are you building KAFKA around a fixed system with well known characteristics or are you required to be able to handle severe spikes in messages)?
  • How will my consumers subscribe to topics? Will they use pre-configured configurations or use a regex to consume many topics? Are the messages between topics dependent or prioritized (need extra logic on consumer to implement priority)?
  • Should you use different network interfaces for replication between brokers (i.e. port 9092 for producers/consumers and 9093 for replication traffic)?

良好链接:

http://cloudurable.com/ppt/4-kafka-detailed-architecture.pdf https://www.slideshare.net/ToddPalino/putting-kafka-into-超速行驶 https://www.slideshare.net/JiangjieQin/no-data-pipe-pipeline-with-apache-kafka-49753844 https://kafka.apache.org/documentation/

这篇关于Kafka拓扑最佳做法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆