将所有Docker Swarm节点作为Manager运行的优缺点？ [英] Pros and Cons of running all Docker Swarm nodes as Managers?

查看：757 发布时间：2020/10/25 4:06:07 docker-swarm

本文介绍了将所有Docker Swarm节点作为Manager运行的优缺点？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在考虑构建Docker Swarm集群。为了使事情既简单又相对容错，我考虑过以管理者身份运行3个节点。

I am considering building out a Docker Swarm cluster. For the purpose of keeping things both simple and relatively fault-tolerant, I thought about simply running 3 nodes as managers.

在不使用任何专用的情况下要进行哪些取舍？工作节点？有什么我应该意识到的可能并不明显吗？

What are the trade-offs when not using any dedicated worker nodes? Is there anything I should be aware of that might not be obvious?

我发现了这个 Github问题提出了类似的问题，但答案对我来说有点模棱两可。它提到性能可能会更差。它还提到达成共识将需要更长的时间。实际上，什么功能会比较慢？那么，花更长的时间才能达成共识到底有什么影响？

I found this Github issue which asks a similar question, but the answer is a bit ambiguous to me. It mentions the performance may be worse. It also mentions that it will take longer to reach consensus. In practice, what functionality would be slower? And what does "take longer to reach consensus" actually affect?

推荐答案

TL;所有经理的灾难恢复利弊作为Swarm中的工作者：

优点：

产品-仅具有3或5个服务器的高质量HA

设计/管理的简单性

默认情况下仍保持安全（秘密在磁盘上加密，相互TLS身份验证和控制平面上的网络加密）

任何节点都可以管理Swarm

Prod-quality HA with only 3 or 5 servers
Simplicity of design/management
Still secure by default (secrets are encrypted on disk, mutual TLS auth and network encryption on control plane)
Any node can administrate the Swarm

缺点：

需要更严格的资源管理，以防止经理饿死

降低安全状态，存储在应用程序中的秘密/密钥服务器

受威胁的节点意味着整个Swarm容易受到威胁

限制为奇数个服务器，通常为3或5

Requires tighter management of resources to prevent manager starvation
Lower secure posture, secrets/keys stored on apps servers
Compromised node means the whole Swarm could easily be compromised
Limited to odd number of servers, usually 3 or 5

问题的完整答案

没有时的权衡是什么使用任何专用的工作程序节点？我应该注意的事情可能并不明显吗？

What are the trade-offs when not using any dedicated worker nodes? Is there anything I should be aware of that might not be obvious?

使用纯工作者节点没有硬性要求。如果您要部署的解决方案知道所需的资源，并且服务/任务的数量通常是相同的，那么只有三名经理来完成所有工作的Swarm并没有错，只要您考虑了这三项受影响的区域：

There are no hard requirements for using worker-only nodes. If you're deploying a solution where you know what resources you need, and the number of services/tasks are usually the same, there's nothing wrong with a Swarm of just three managers doing all the work, as long as you have considered these three areas that are affected:

安全性。在理想环境中，您的经理将无法访问Internet，而只能在后端子网中，仅执行经理工作。管理者拥有Swarm的所有权限，拥有所有加密的机密，存储加密的Raft日志，以及（默认情况下）将加密密钥存储在磁盘上。工人们只存储他们需要的秘密（并且仅存储在内存中），除了领导者指示他们做的以外，无权在Swarm中进行任何工作。如果某个工人受到侵害，则不一定意味着您失去了一群。三权分立并不是一个硬性要求，许多环境都接受了这种风险，只是将管理器作为将向公众发布服务的主要服务器。仅仅是安全/复杂性与成本的问题。

节点数。冗余管理器的最小数量为3，而我大多数时候建议3或5个。越来越多的经理人并不等于拥有更多的能力，因为在任何时候，只有一位经理人是领导者，并且是唯一一位负责经理人工作的人。领导者的资源能力决定了他可以同时进行多少工作。如果您的经理也在做应用程序工作，并且您需要更多的资源容量，那么3个节点可以处理，那么我建议第4个节点或更高的节点只是工人。

性能/ scale 。理想情况下，您的经理拥有他们快速执行任务所需的所有资源，例如领导者选举，任务调度，运行状况以及对运行状况检查的反应等。他们的资源利用率将随着总节点数，总服务量和新资源率的增加而增长。他们必须执行的工作（服务/网络创建，任务更改，节点更改，运行状况检查等）。如果您的服务器数量少，服务/副本数量少，那么只要您谨慎（对服务使用资源限制）以防止您的应用程序（尤其是数据库）饿死，就可能使经理也成为工作人员资源的docker守护进程非常糟糕，以至于Swarm无法完成其工作。当您开始随机更改领导者或出现错误/故障时，您可能希望在故障排除步骤的简短列表中检查管理器中的可用资源。

Security. In a perfect world, your managers would not be internet accessible and would only be on a backend subnet, doing only manager work. The managers have all the authority for the Swarm, hold all the encrypted secrets, store the encrypted Raft log, and also (by default) store the encryption keys on disk. Workers only store secrets they need, (and only in memory) and have no authority to do any work in the Swarm other then what they've been told to do by the leader. If a worker gets compromised you haven't "lost the Swarm" necessarily. This separation of powers is not a hard requirement, and many environments accept this risk and just put the managers as the main servers that will publish services to the public. It's just a question of security/complexity vs. cost.
Node count. The minimum number of managers for redundancy is 3, and 3 or 5 is what I recommend most of the time. More managers do not equal more capacity, as only one manager is the leader at any time, and the only one to do manager work. The resource capacity of the leader is what determines how much work it can do simultaneously. If your managers are also doing app work, and you need more resource capacity then 3 nodes could handle, then I'd recommend the 4th node and higher are just workers.
Performance/scale. Ideally, your managers have all the resources they need to do things fast, like leader election, task scheduling, running and reacting to healthchecks, etc. Their resource utilization will grow the larger the number of total nodes, total services, and rate of new work they have to perform (service/network creation, task changes, node changes, healthchecks, etc.). If you have a small number of servers and small number of services/replicas, then you could likely have the managers also be workers as long as you're careful (use resource limits on services) to prevent your apps (especially databases) from starving the docker daemon of resources so bad that Swarm can't do its job. When you start having random leader changes or errors/failures, you would want "check the managers for available resources" on your short list of troubleshooting steps.

其他问题：

实际上，什么功能会比较慢？而花更长的时间才能达成共识实际上会受到什么影响？

In practice, what functionality would be slower? And what does "take longer to reach consensus" actually affect?

更多经理=当经理去时，经理选举新领导人的时间更长下。在没有领导者的情况下，Swarm处于只读状态，无法启动新的副本任务，也不会进行服务更新。任何失败的容器都不会自动恢复，因为Swarm管理器无法工作。您正在运行应用程序，入口路由网格等。经理健康状况和领导者选举的大部分性能与所有经理节点之间的网络延迟有关，而这与经理的数量一样多。这就是为什么Docker通常建议一个Swarms经理都在同一个区域内，以便他们之间的延迟低。这里没有硬性规定。如果您测试经理与测试失败之间的200毫秒延迟，并且对领导者选举的结果和速度感到满意，那就太酷了。

More managers = longer for managers to elect a new leader when one goes down. While there is no leader, the Swarm is in a read-only state and new replica tasks cannot be launched and service updates won't happen. Any container that fails won't auto-recover because the Swarm managers can't do work. You're running apps, ingress routing mesh, etc. all still function. A large part of the performance of manager health and leader election is tied to network latency between all manager nodes, as much as it is the number of managers. This is why Docker generally advises that a single Swarms managers all be in the same region so they get a low-latency round trip between each other. There is no hardset rule here. If you test 200ms latency between managers and test failures and are fine with the results and speed of leader election, cool.

背景信息：