Kubernetes 上的 Kafka 多节点 [英] Kafka on Kubernetes multi-node

查看:39
本文介绍了Kubernetes 上的 Kafka 多节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我的目标是以分布式方式建立一个由几个 kafka-broker 组成的集群.但是我看不到让经纪人相互了解的方法.

据我所知,每个代理在他们的配置中都需要一个单独的 ID,如果我从 kubernetes 启动容器,我不能保证或配置它?

他们也需要有相同的adverted_host?

是否有任何我遗漏的参数需要更改才能让节点相互发现?

在 Dockerfile 末尾用脚本做这样的配置可行吗?和/或共享卷?

我目前正在尝试在 vanilla Kubernetes 上使用 spotify/kafka-image 执行此操作,该图像具有预配置的 zookeeper+kafka 组合.

解决方案

我对此的解决方案是使用 IP 作为 ID:修剪点,你会得到一个唯一的 ID,它也是可在容器外提供给其他容器.

使用服务,您可以访问多个容器的 IP(有关如何执行此操作的信息,请参阅我在此处的回答:最好的方法是什么让 kubenetes pod 相互通信?

因此,如果您使用 IP 作为唯一 ID,您也可以获得他们的 ID.唯一的问题是 ID 不连续或从 0 开始,但 zookeeper/kafka 似乎并不介意.

编辑 1:

后续关注配置 Zookeeper:

每个 ZK 节点都需要知道其他节点.Kubernetes 发现服务知道 Service 内的节点,因此想法是使用 ZK 节点启动 Service.

此服务需要在创建 Zookeeper pod 的 ReplicationController (RC) 之前启动.

ZK 容器的启动脚本将需要:

  • 等待发现服务用它的节点填充 ZK 服务(这需要几秒钟,现在我只是在我的启动脚本的开头添加一个 sleep 10 但更可靠的是你应该在其中至少有 3 个节点.)
  • 在发现服务中查找构成服务的容器:这是通过查询 API 来完成的.KUBERNETES_SERVICE_HOST 环境变量在每个容器中都可用.查找服务描述的端点是

URL="http(s)://$USERNAME:$PASSWORD@${KUBERNETES_SERVICE_HOST/api/v1/namespaces/${NAMESPACE}/endpoints/${SERVICE_NAME}"

其中 NAMESPACEdefault 除非您更改它,如果您将服务命名为 zookeeper,SERVICE_NAME 将是 zookeeper.

在那里你会得到组成服务的容器的描述,它们的 ip 位于ip"字段中.你可以这样做:

curl -s $URL |grep '\"ip\"' |awk '{print $2}' |awk -F\" '{print $2}'

获取服务中的 IP 列表.然后,使用上面定义的 ID 在节点上填充 zoo.cfg

您可能需要 USERNAMEPASSWORD 才能到达谷歌容器引擎等服务的端点.这些需要放在 Secret 卷中(请参阅此处的文档:http://kubernetes.io/v1.0/docs/user-guide/secrets.html )

您还需要在 Google Container Engine 上使用 curl -s --insecure,除非您遇到了将 CA 证书添加到 Pod 的麻烦

基本上将卷添加到容器中,并从文件中查找值.(与文档所说的相反,base64 编码时不要将 \n 放在用户名或密码的末尾:它只会让您在阅读这些内容时变得更加复杂)

编辑 2:

您需要在 Kafka 节点上做的另一件事是获取 IP 和主机名,并将它们放入/etc/hosts 文件中.Kafka 好像需要通过主机名来知道节点,这些默认不在服务节点中设置

编辑 3:

经过多次尝试和思考,使用 IP 作为 ID 可能不是那么好:这取决于您如何配置存储.对于任何类型的分布式服务,如 zookeeper、kafka、mongo、hdfs,您可能想要使用 emptyDir 类型的存储,所以它只是在那个节点上(安装远程存储类型违背了分发这些服务的目的!)emptyDir 将重新加载同一节点上的数据,因此使用 NODE ID(节点 IP)作为 ID 似乎更合乎逻辑,因为在同一节点上重新启动的 pod 将具有数据.这避免了数据的潜在损坏(如果一个新节点开始在同一个目录中写入实际上不是空的,谁知道会发生什么)以及 Kafka,如果代理 id 发生变化,主题会被分配一个 broker.id,zookeeper 没有更新主题 broker.id 并且主题看起来可用但指向错误的 broker.id 并且它是一团糟.

到目前为止,我还没有找到如何获取节点 IP,但我认为可以通过查找服务 Pod 名称和部署它们的节点来查找 API.

编辑 4

要获取节点 IP,可以从端点 API 获取 pod 主机名 == 名称/api/v1/namespaces/default/endpoints/如上所述.然后你可以从pod名称中获取节点IP/api/v1/namespaces/default/pods/

PS:这受到 Kubernetes 存储库中示例的启发(此处为 rethinkdb 示例:https://github.com/kubernetes/kubernetes/tree/master/examples/rethinkdb

So my objective here is to set up a cluster of several kafka-brokers in a distributed fashion. But I can't see the way to make the brokers aware of each other.

As far as i understand, every broker needs a separate ID in their config, which I cannot guarantee or configure if I launch the containers from kubernetes?

They also need to have the same advertised_host?

Are there any parameters I'm missing that would need to be changed for the nodes to discover each other?

Would it be viable to do such a configuration at the end of the Dockerfile with a script? And/or a shared volume?

I'm currently trying to do this with the spotify/kafka-image which has a preconfigured zookeeper+kafka combination, on vanilla Kubernetes.

解决方案

My solution for this has been to use the IP as the ID: trim the dots and you get a unique ID that is also available outside of the container to other containers.

With a Service you can get access to the multiple containers's IPs (see my answer here on how to do this: what's the best way to let kubenetes pods communicate with each other?

so you can get their IDs too if you use IPs as the unique ID. The only issue is that IDs are not continuous or start at 0, but zookeeper / kafka don't seem to mind.

EDIT 1:

The follow up concerns configuring Zookeeper:

Each ZK node needs to know of the other nodes. The Kubernetes discovery service knowns of nodes that are within a Service so the idea is to start a Service with the ZK nodes.

This Service needs to be started BEFORE creating the ReplicationController (RC) of the Zookeeper pods.

The start-up script of the ZK container will then need to:

  • wait for the discovery service to populate the ZK Service with its nodes (that takes a few seconds, for now I just add a sleep 10 at the beginning of my startup script but more reliably you should look for the service to have at least 3 nodes in it.)
  • look up the containers forming the Service in the discovery service: this is done by querying the API. the KUBERNETES_SERVICE_HOST environment variable is available in each container. The endpoint to find service description is then

URL="http(s)://$USERNAME:$PASSWORD@${KUBERNETES_SERVICE_HOST/api/v1/namespaces/${NAMESPACE}/endpoints/${SERVICE_NAME}"

where NAMESPACE is default unless you changed it, and SERVICE_NAME would be zookeeper if you named your service zookeeper.

there you get the description of the containers forming the Service, with their ip in a "ip" field. You can do:

curl -s $URL | grep '\"ip\"' | awk '{print $2}' | awk -F\" '{print $2}' 

to get the list of IPs in the Service. With that, populate the zoo.cfg on the node using the ID defined above

You might need the USERNAME and PASSWORD to reach the endpoint on services like google container engine. These need to be put in a Secret volume (see doc here: http://kubernetes.io/v1.0/docs/user-guide/secrets.html )

You would also need to use curl -s --insecure on Google Container Engine unless you go through the trouble of adding the CA cert to your pods

Basically add the volume to the container, and look up the values from file. (contrary to what the doc says, DO NOT put the \n at the end of the username or password when base64 encoding: it just make your life more complicated when reading those)

EDIT 2:

Another thing you'll need to do on the Kafka nodes is get the IP and hostnames, and put them in the /etc/hosts file. Kafka seems to need to know the nodes by hostnames, and these are not set within service nodes by default

EDIT 3:

After much trial and thoughts using IP as an ID may not be so great: it depends on how you configure storage. for any kind of distributed service like zookeeper, kafka, mongo, hdfs, you might want to use the emptyDir type of storage, so it is just on that node (mounting a remote storage kind of defeats the purpose of distributing these services!) emptyDir will relaod with the data on the same node, so it seems more logical to use the NODE ID (node IP) as the ID, because then a pod that restarts on the same node will have the data. That avoid potential corruption of the data (if a new node starts writing in the same dir that is not actually empty, who knows what can happen) and also with Kafka, the topics being assigned a broker.id, if the broker id changes, zookeeper does not update the topic broker.id and the topic looks like it is available BUT points to the wrong broker.id and it's a mess.

So far I have yet to find how to get the node IP though, but I think it's possible to lookup in the API by looking up the service pods names and then the node they are deployed on.

EDIT 4

To get the node IP, you can get the pod hostname == name from the endpoints API /api/v1/namespaces/default/endpoints/ as explained above. then you can get the node IP from the pod name with /api/v1/namespaces/default/pods/

PS: this is inspired by the example in the Kubernetes repo (example for rethinkdb here: https://github.com/kubernetes/kubernetes/tree/master/examples/rethinkdb

这篇关于Kubernetes 上的 Kafka 多节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆