节点死亡时Docker Swarm无法调度容器 [英] Docker Swarm not scheduling containers when node dies

查看:374
本文介绍了节点死亡时Docker Swarm无法调度容器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑


请注意此问题。我发现我的一项服务正在使用


然后我通过耗尽它来使其中一个工作节点不可用


docker节点更新-可用性消耗****-v3-6by7ddst


我注意到的是,所有运行在耗尽状态的服务节点已删除,并且未计划到可用节点。
可用的工作节点和管理者节点仍然具有大量资源。.这些服务只是删除了。我现在只有9个服务




看着日志,我看到了类似波纹管的东西,但是重复了具有不同服务ID的

  level = warning msg =您的内核不支持交换限制功能或未安装cgroup。内存有限,无需交换。 
level = error msg ="获取服务u68b1fofzb3nefpnasctpywav时出错:未找到服务u68b1fofzb3nefpnasctpywav。
level = warning msg = rmServiceBinding 021eda460c5744fd4d499475e5aa0f1cfbe5df479e5b21389ed1b501a93b47e1可能的瞬态ok:false条目:0 set:false

然后,出于调试目的,我将节点设置回可用的
docker节点更新--availability active ****-v3-6by7ddst


然后,我尝试将一些服务平衡到新可用的节点。这就是结果。



我在日志上遇到相同的错误

  level = error msg ="获取服务**** _ frontend时出错:服务**** _ frontend未找到 
level = warning msg = rmServiceBinding 6bb220c0a95b30cdb3ff7b577c7e9dec7ad6383b34aff85e1685e94e7486e3ea可能的瞬态ok:false条目:0 set:false
msg =获取服务l29wlucttul75pzqo2sgr0u9e时出错:未找到服务l29wlucttul75pzqo2sgr0u9e

在我的docker-compose文件中,我像这样配置所有服务。

 前端:
图片:{FRONTEND_IMAGE}
部署:
标签:
- traefik.enable = true;
- traefik.docker.lbswarm = true
- traefik.http.routers.frontend.rule = Host(`$ {FRONTEND_HOST}`)
- traefik.http.routers.frontend.entrypoints = websecure
- traefik.http.routers.frontend.tls.certresolver = myhttpchallenge
- traefik.http.services.frontend.loadbalancer.server.port = 80
- traefik.docker.network = ingress
副本:1
资源:
限制:
内存:$ {FRONTEND_LIMITS_MEMORY}
cpus:$ {FRONTEND_LIMITS_CPUS}
预留空间:
内存:$ {FRONTEND_RESERVATION_MEMORY}
cpus:$ {FRONTEND_RESERVATION_CPUS}
restart_policy:
条件:任何
网络:
-入口

在不同节点上重新创建服务时,某些操作失败,即使只有一个经理/工作人员节点,我也得到相同的结果。


其余的似乎工作正常。举例来说,如果我扩展服务,它将很好地工作。


新编辑


刚刚进行了另一项测试。



  • 这次,我只有traefik和前端两项服务。

  • 一个traefik实例

  • 4个前端实例

  • 两个节点(一个管理器和一个工人)

  • 耗尽的工人节点和前端实例正在运行耗尽节点上的节点移到管理器节点上

  • 激活了工作节点

  • docker服务更新了cords_frontend- -force 和两个前端实例在管理器节点上被杀死,并在工作程序节点上运行。


因此,对于只有两个服务的测试,一切正常。


服务和堆栈的数量是否有任何限制?


任何线索为什么会这样ng?


谢谢


Hugo

解决方案

我相信您可能会遇到资源预留问题。您提到可用节点有很多资源,但是保留的工作方式是,如果服务不能保留指定的资源,则不会调度服务,非常重要的一点是要注意,这与服务有多少资源无关实际使用。这意味着,如果您指定保留,则基本上是在说服务将保留该数量的资源,而这些资源不可用于其他服务。因此,如果您所有的服务都具有相似的保留,那么您可能会遇到这样的情况,即使该节点显示了可用资源,这些资源实际上仍由现有服务保留。因此,我建议您删除保留部分,然后尝试查看是否确实在发生这种情况。


EDIT

NEVEMIND THIS QUESTION. I found that one of my services, which is using a Docker.DotNet, was terminating the services marked as Shutdown. I've corrected the bug and have regained my trust in Docker and Docker Swarm. Thank you Carlos for you help. My bad, my fault. Sorry for that!

I have 13 services configured on a docker-compose file and running in Swarm mode with one manager and two worker nodes.

Then I make one of the worker nodes unavailable by draining it

docker node update --availability drain ****-v3-6by7ddst

What I notice is that all the services that where running on the drained node are removed and not scheduled to the available node. The available worker and manager nodes still have plenty of resources.. The services are simply removed.I am now down to 9 services

Looking at the logs I see stuff like bellow but repeated with different service ids

level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
level=error msg="Error getting service u68b1fofzb3nefpnasctpywav: service u68b1fofzb3nefpnasctpywav not found"
level=warning msg="rmServiceBinding 021eda460c5744fd4d499475e5aa0f1cfbe5df479e5b21389ed1b501a93b47e1 possible transient state ok:false entries:0 set:false "

Then, for debug purposes I set my node back to available docker node update --availability active ****-v3-6by7ddst

Then I try to balance some of the services to the newly available node. And this is the result.

I get the same error on the logs

level=error msg="Error getting service ****_frontend: service ****_frontend not found"
level=warning msg="rmServiceBinding 6bb220c0a95b30cdb3ff7b577c7e9dec7ad6383b34aff85e1685e94e7486e3ea possible transient state ok:false entries:0 set:false "
msg="Error getting service l29wlucttul75pzqo2sgr0u9e: service l29wlucttul75pzqo2sgr0u9e not found"

On my docker-compose file I am configuring all my services like this. Restart policy is any.

  frontend:
    image: {FRONTEND_IMAGE}
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.docker.lbswarm=true"
        - "traefik.http.routers.frontend.rule=Host(`${FRONTEND_HOST}`)"
        - "traefik.http.routers.frontend.entrypoints=websecure"
        - "traefik.http.routers.frontend.tls.certresolver=myhttpchallenge"
        - "traefik.http.services.frontend.loadbalancer.server.port=80"
        - "traefik.docker.network=ingress"
      replicas: 1
      resources:
        limits:
          memory: ${FRONTEND_LIMITS_MEMORY}
          cpus: ${FRONTEND_LIMITS_CPUS}
        reservations:
          memory: ${FRONTEND_RESERVATION_MEMORY}
          cpus: ${FRONTEND_RESERVATION_CPUS}
      restart_policy:
        condition: any
    networks:
      - ingress

Something fails while recreating services on different nodes, and even with only one manager/worker node I get the same result.

The rest seems to work fine. As an example, if I scale a service it works well.

New Edit

Just did another test.

  • This time I only have two services, traefik and front-end.
  • One instance for traefik
  • 4 instances for front-end
  • two nodes (one manager and one worker)
  • Drained worker node and front-end instances running on the drained node are moved to the manager node
  • Activated back the worker node
  • Did a docker service update cords_frontend --force and two instances of front-end are killed on the manager node and are placed running on the worker node.

So, with this test with only two services everything works fine.

Is there any kind of limit to the number of services and stack should have?

Any clues why this is happening?

Thanks

Hugo

解决方案

I believe you may be running into an issue with resource reservations. You mention that the nodes available have plenty of resources, but the way reservations work, a service will not be scheduled if it can't reserve the resources specified, very important to note that this has nothing to do with how much resources the service is actually using. This means that if you specify a reservation you are basically saying that service will reserve that amount of resources and those resources are not available for other services to use. So if all your services have similar reservations you may be running into a situation where even though the node shows available resources, those resources are in fact reserved by the existing service. So I would suggest you remove the reservations section and try it to see if that is in fact what is happening.

这篇关于节点死亡时Docker Swarm无法调度容器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆