Kafka在Kubernetes多节点 [英] Kafka on Kubernetes multi-node

查看:338
本文介绍了Kafka在Kubernetes多节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我的目标是以分散的方式设立几个卡夫卡经纪人的集群。但是我看不到让经纪人意识到的方法。



据我所知,每个代理在配置中需要一个单独的ID,不能保证或配置是否从kubernetes启动容器?



他们还需要有相同的advertised_host?



是否有任何缺失的参数,需要更改节点才能相互发现?



最终执行此类配置是否可行的Docker文件与脚本?和/或一个共享卷?



我正在尝试通过在vanilla Kubernetes上预先配置的zookeeper + kafka组合的spotify / kafka映像。

解决方案

我的解决方案是使用IP作为ID :修剪点您可以获得一个唯一的ID,该容器也可以在其他容器之外。



使用服务可以访问多个容器的IP(请参阅我的答案如何做到这一点:$ b​​ $ b 什么是让kubenetes pods相互通讯的最佳方法?



所以你也可以得到他们的ID,如果你使用IP作为唯一的ID
唯一的问题是ID不连续或从0开始,但是zookeeper / kafka似乎不介意。



编辑1:



关于配置Zookeeper的后续操作:



每个ZK节点需要知道其他节点。 Kubernetes发现服务已知在服务内的节点,所以想法是使用ZK节点启动服务



然后ZK容器需要:




  • 等待发现服务使用其节点填充ZK服务(需要几秒钟,现在我只是在我的启动脚本开始添加一个睡眠10,但更可靠的是,你应该找到服务中至少有3个节点。)

  • 查找容器的形成发现服务中的服务:
    这是通过查询API来完成的。
    KUBERNETES_SERVICE_HOST 环境变量在每个容器中可用。
    找到服务描述的端点是



URL =http(s):/ / $ USERNAME:$ PASSWORD @ $ {KUBERNETES_SERVICE_HOST / api / v1 / namespaces / $ {NAMESPACE} / endpoints / $ {SERVICE_NAME}



其中 NAMESPACE 默认,除非您更改,而 SERVICE_NAME 如果您命名了您的服务zookeeper,将会是zookeeper。



您可以获得形成服务的容器的描述,其IP位于ip字段中。
您可以执行以下操作:

  curl -s $ URL | grep'\ip\'| awk'{print $ 2}'| awk -F\'{print $ 2}'

获取服务中的IP列表
这样,使用上面定义的ID填充节点上的zoo.cfg



您可能需要 USERNAME PASSWORD 到达服务端点,如谷歌容器引擎,这些需要放在密钥卷中(请参阅这里的文档: http://kubernetes.io/v1.0/docs/user-guide/secrets.html



您还需要在Google集装箱引擎上使用 curl -s --insecure ,除非您遇到将CA证书添加到您的荚中



基本上将卷添加到容器中,并查找文件中的值(与文档所示相反,请勿将当base64编码时,在用户名或密码的末尾:这只是让你的生活在阅读时变得更加复杂)



编辑2:



另一件您需要在Kafka节点上进行的操作是获取IP和主机名,并将它们放在/ etc / hosts文件中。
Kafka似乎需要按主机名知道节点,默认情况下,这些节点不会设置在服务节点中



编辑3:



经过多次尝试和使用IP作为ID的想法可能不是很好:它取决于如何配置存储。
对于任何类型的分布式服务,如zookeeper,kafka,mongo,hdfs,您可能希望使用emptyDir类型的存储,所以它只是在该节点上(安装远程存储类型的目的是分发这些服务!)
emptyDir将与同一节点上的数据相关联,因此使用NODE ID(节点IP)作为ID似乎更合乎逻辑,因为在同一节点上重新启动的pod将具有数据。
避免潜在的数据损坏(如果一个新节点开始写入相同的目录,实际上不是空的,谁知道可能发生什么),并且还与Kafka一起分配一个broker.id的主题,如果代理ID更改,zookeeper不更新主题broker.id,主题看起来像可用的,但指向错误的broker.id,这是一个混乱。



所以我还没有找到如何获得节点IP,但是我认为可以通过查找服务端口名称,然后查找它们部署的节点来查找API。



编辑4



要获取节点IP,您可以从端点API $ b获取pod hostname == name $ b / api / v1 / namespaces / default / endpoints /
如上所述。
那么你可以从pod名称获得节点IP
/ api / v1 / namespaces / default / pods /



PS:这是灵感来自于Kubernetes repo中的示例(rethinkdb示例: https:// github.com/kubernetes/kubernetes/tree/master/examples/rethinkdb


So my objective here is to set up a cluster of several kafka-brokers in a distributed fashion. But I can't see the way to make the brokers aware of each other.

As far as i understand, every broker needs a separate ID in their config, which I cannot guarantee or configure if I launch the containers from kubernetes?

They also need to have the same advertised_host?

Are there any parameters I'm missing that would need to be changed for the nodes to discover each other?

Would it be viable to do such a configuration at the end of the Dockerfile with a script? And/or a shared volume?

I'm currently trying to do this with the spotify/kafka-image which has a preconfigured zookeeper+kafka combination, on vanilla Kubernetes.

解决方案

My solution for this has been to use the IP as the ID: trim the dots and you get a unique ID that is also available outside of the container to other containers.

With a Service you can get access to the multiple containers's IPs (see my answer here on how to do this: what's the best way to let kubenetes pods communicate with each other?

so you can get their IDs too if you use IPs as the unique ID. The only issue is that IDs are not continuous or start at 0, but zookeeper / kafka don't seem to mind.

EDIT 1:

The follow up concerns configuring Zookeeper:

Each ZK node needs to know of the other nodes. The Kubernetes discovery service knowns of nodes that are within a Service so the idea is to start a Service with the ZK nodes.

This Service needs to be started BEFORE creating the ReplicationController (RC) of the Zookeeper pods.

The start-up script of the ZK container will then need to:

  • wait for the discovery service to populate the ZK Service with its nodes (that takes a few seconds, for now I just add a sleep 10 at the beginning of my startup script but more reliably you should look for the service to have at least 3 nodes in it.)
  • look up the containers forming the Service in the discovery service: this is done by querying the API. the KUBERNETES_SERVICE_HOST environment variable is available in each container. The endpoint to find service description is then

URL="http(s)://$USERNAME:$PASSWORD@${KUBERNETES_SERVICE_HOST/api/v1/namespaces/${NAMESPACE}/endpoints/${SERVICE_NAME}"

where NAMESPACE is default unless you changed it, and SERVICE_NAME would be zookeeper if you named your service zookeeper.

there you get the description of the containers forming the Service, with their ip in a "ip" field. You can do:

curl -s $URL | grep '\"ip\"' | awk '{print $2}' | awk -F\" '{print $2}' 

to get the list of IPs in the Service. With that, populate the zoo.cfg on the node using the ID defined above

You might need the USERNAME and PASSWORD to reach the endpoint on services like google container engine. These need to be put in a Secret volume (see doc here: http://kubernetes.io/v1.0/docs/user-guide/secrets.html )

You would also need to use curl -s --insecure on Google Container Engine unless you go through the trouble of adding the CA cert to your pods

Basically add the volume to the container, and look up the values from file. (contrary to what the doc says, DO NOT put the \n at the end of the username or password when base64 encoding: it just make your life more complicated when reading those)

EDIT 2:

Another thing you'll need to do on the Kafka nodes is get the IP and hostnames, and put them in the /etc/hosts file. Kafka seems to need to know the nodes by hostnames, and these are not set within service nodes by default

EDIT 3:

After much trial and thoughts using IP as an ID may not be so great: it depends on how you configure storage. for any kind of distributed service like zookeeper, kafka, mongo, hdfs, you might want to use the emptyDir type of storage, so it is just on that node (mounting a remote storage kind of defeats the purpose of distributing these services!) emptyDir will relaod with the data on the same node, so it seems more logical to use the NODE ID (node IP) as the ID, because then a pod that restarts on the same node will have the data. That avoid potential corruption of the data (if a new node starts writing in the same dir that is not actually empty, who knows what can happen) and also with Kafka, the topics being assigned a broker.id, if the broker id changes, zookeeper does not update the topic broker.id and the topic looks like it is available BUT points to the wrong broker.id and it's a mess.

So far I have yet to find how to get the node IP though, but I think it's possible to lookup in the API by looking up the service pods names and then the node they are deployed on.

EDIT 4

To get the node IP, you can get the pod hostname == name from the endpoints API /api/v1/namespaces/default/endpoints/ as explained above. then you can get the node IP from the pod name with /api/v1/namespaces/default/pods/

PS: this is inspired by the example in the Kubernetes repo (example for rethinkdb here: https://github.com/kubernetes/kubernetes/tree/master/examples/rethinkdb

这篇关于Kafka在Kubernetes多节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆