群集模式路由网格不起作用,而是默认情况下像主机模式一样工作 [英] Swarm mode routing mesh not working, instead is working like host mode by default

查看:69
本文介绍了群集模式路由网格不起作用,而是默认情况下像主机模式一样工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说明

集群模式路由网格不起作用,相反,它的工作方式类似于默认使用主机模式.

Swarm mode routing mesh not working, instead, it is working like using host mode by default.

我们正在部署3个主节点和8个工作节点的集群,每个集群都使用Terraform和Ansible部署在云服务 OpenStack 的不同实例中.群和路由网格工作正常,因为它停止工作并开始像在主机模式下那样工作.我们没有做任何更改,也没有做任何更新或部署新服务.我们尝试重新启动群集,然后重新部署群集和所有服务,但是没有任何效果,我们无法使其再次在路由网格模式下工作.因此,我们决定销毁所有实例并从头开始(下面报告的问题).我们像以前一样对 Ubuntu 18.04 LTS 和docker进行了全新安装.然后,我们设置了1个主节点和2个工作节点(这次是手动)并部署了一项服务,但是集群仍然像在主机模式下一样工作.

We were deploying a swarm of 3 masters nodes and 8 worker nodes, each of them in a different instance of a cloud service OpenStack using Terraform and Ansible. The swarm and routing mesh was working perfectly since it stopped working and started working like in a host mode. We didn't change anything nor done any update or deploy new services. We tried to restart the swarm and re-deploy the swarm and all services, but nothing worked, we couldn't make it work in routing mesh mode again. So, we decided to destroy all instances and start from scratch (the issue reported below). We did a clean installation of Ubuntu 18.04 LTS and docker as we did before. Then we set 1 master node and 2 workers (this time manually) and deploy one service, but the swarm is still working like in host mode.

访问服务的唯一方法是通过运行该服务的节点的IP地址,否则将没有答案(超时).我们尝试使用管理器或其他工作程序实例的IP进行访问,但是无法访问该服务.这就是为什么我们认为群集默认情况下使用主机模式,而不是入口网络和路由网格.

The only way to access the services is by the IP address of the node where it is running, otherwise, there is no answer (time out). We tried to access using the IP of the manager or the other worker instances, but it is not possible to access to the service. That is why we supposed that the swarm is using host mode by default instead of the ingress network and routing mesh.

我们还尝试了Mongo或Cassandra等不同的服务,但是行为是相同的,群集看起来就像使用主机模式一样.您只能使用运行服务的实例IP地址来访问该服务.

We also tried with different services like Mongo or Cassandra but the behaviour is the same, the swarm looks like working using host mode. You can only access the service by using the instance IP address where the service is running.

关于如何最大程度地绕过主机并回到路由网格的任何想法?我们希望仅通过使用假定处于排空"模式的管理器节点的IP地址来访问任何服务.

Any ideas to how to bypass the host most and go back to the routing mesh? We want to access to any service only by using the IP address of the manager nodes which are supossed to be in Drain mode.

重现此问题的步骤:

  1. [ manager ] sudo docker swarm init --advertise-addr 158.39.201.14
  2. [ worker-0 ] sudo docker swarm join --token SWMTKN-1-3np0cy0msnfurecckl4863hkftykuqkgeq998s1hix6jsoiarq-758o52hymaiyzv74w3u1yzltt 158.39.201.14:2377
  3. [ worker-1 ] sudo docker swarm join --token SWMTKN-1-3np0cy0msnfurecckl4863hkftykuqkgeq998s1hix6jsoiarq-758o52hymaiyzv74w3u1yzltt 158.39.201.14:2377
  4. [ manger ] sudo docker stack deploy -c docker-compose.yml nh
  1. [manager] sudo docker swarm init --advertise-addr 158.39.201.14
  2. [worker-0] sudo docker swarm join --token SWMTKN-1-3np0cy0msnfurecckl4863hkftykuqkgeq998s1hix6jsoiarq-758o52hyma iyzv74w3u1yzltt 158.39.201.14:2377
  3. [worker-1] sudo docker swarm join --token SWMTKN-1-3np0cy0msnfurecckl4863hkftykuqkgeq998s1hix6jsoiarq-758o52hyma iyzv74w3u1yzltt 158.39.201.14:2377
  4. [manger] sudo docker stack deploy -c docker-compose.yml nh

描述您收到的结果:

curl http://[worker-0-ip]:8089/bigdata 200确定

curl http://[worker-1-ip]:8089/bigdata 失败超时

描述您期望的结果:

curl http://[worker-0-ip]:8089/bigdata 200确定

curl http://[worker-1-ip]:8089/bigdata 200确定

您认为重要的其他信息(例如,问题偶尔发生):

此问题2天前没有发生,并且突然开始发生.我们没有做任何修改,也没有触摸服务器.

This issue was not happening 2 days ago and suddently it started happening. We didn't made any modification nor touch the servers.

docker-compose.yml

version: '3.7'

networks:
  news-hunter:
    name: &network news-hunter

x-network: &network-base
  networks:
    - *network

services:
   blazegraph:
    <<: *network-base
    image: lyrasis/blazegraph:2.1.5
    ports:
      - published: 8089
        target: 8080
    deploy:
      placement:
        constraints:
          - node.role == worker 

manager,worker-1和worker-2的IP表(都相同): sudo iptables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-INGRESS  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain DOCKER (2 references)
target     prot opt source               destination

Chain DOCKER-INGRESS (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:8089
ACCEPT     tcp  --  anywhere             anywhere             state RELATED,ESTABLISHED tcp spt:8089
RETURN     all  --  anywhere             anywhere

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target     prot opt source               destination
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-ISOLATION-STAGE-2 (2 references)
target     prot opt source               destination
DROP       all  --  anywhere             anywhere
DROP       all  --  anywhere             anywhere
RETURN     all  --  anywhere             anywhere

Chain DOCKER-USER (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

管理器端口: sudo netstat -tuplen

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      101        46731      14980/systemd-resol
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      0          17752      865/sshd
tcp6       0      0 :::22                   :::*                    LISTEN      0          17757      865/sshd
tcp6       0      0 :::8089                 :::*                    LISTEN      0          306971     24992/dockerd
tcp6       0      0 :::2377                 :::*                    LISTEN      0          301970     24992/dockerd
tcp6       0      0 :::7946                 :::*                    LISTEN      0          301986     24992/dockerd
udp        0      0 127.0.0.53:53           0.0.0.0:*                           101        46730      14980/systemd-resol
udp        0      0 158.39.201.14:68        0.0.0.0:*                           100        46591      14964/systemd-netwo
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           0          302125     -
udp6       0      0 fe80::f816:3eff:fef:546 :::*                                100        46586      14964/systemd-netwo
udp6       0      0 :::7946                 :::*                                0          301987     24992/dockerd

工作端口: sudo netstat -tuplen

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      101        44998      15283/systemd-resol
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      0          15724      1010/sshd
tcp6       0      0 :::22                   :::*                    LISTEN      0          15726      1010/sshd
tcp6       0      0 :::8089                 :::*                    LISTEN      0          300227     25355/dockerd
tcp6       0      0 :::7946                 :::*                    LISTEN      0          283636     25355/dockerd
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           0          285465     -
udp        0      0 127.0.0.53:53           0.0.0.0:*                           101        44997      15283/systemd-resol
udp        0      0 158.39.201.15:68        0.0.0.0:*                           100        233705     15247/systemd-netwo
udp6       0      0 :::7946                 :::*                                0          283637     25355/dockerd
udp6       0      0 fe80::f816:3eff:fee:546 :::*                                100        48229      15247/systemd-netwo

正在运行的服务: sudo docker service ls

ID                  NAME                MODE                REPLICAS            IMAGE                      PORTS
m7eha88ff4wm        nh_blazegraph       replicated          1/1                 lyrasis/blazegraph:2.1.5   *:8089->8080/tcp

堆栈: sudo docker stack ps nh

ID                  NAME                IMAGE                      NODE                DESIRED STATE       CURRENT STATE         ERROR               PORTS
tqkd9t4i03yt        nh_blazegraph.1     lyrasis/blazegraph:2.1.5   nh-worker-0         Running             Running 3 hours ago

docker版本的输出:

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.6
 API version:       1.40
 Go version:        go1.12.16
 Git commit:        369ce74a3c
 Built:             Thu Feb 13 01:27:49 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.6
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.16
  Git commit:       369ce74a3c
  Built:            Thu Feb 13 01:26:21 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

docker info 的输出:

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 1
  Running: 0
  Paused: 0
  Stopped: 1
 Images: 1
 Server Version: 19.03.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: hpcm67vxrmkm1wvlhfqbjevox
  Is Manager: true
  ClusterID: gnl96swlf7o3a976oarvjrazt
  Managers: 1
  Nodes: 3
  Default Address Pool: 10.0.0.0/8
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 158.39.201.14
  Manager Addresses:
   158.39.201.14:2377
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-74-generic
 Operating System: Ubuntu 18.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 3.852GiB
 Name: nh-manager-0
 ID: PHBO:E6UZ:RNJL:5LVU:OZXW:FM5M:GQVW:SCAQ:EEQW:7IIW:GARL:AUHI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

服务检查: sudo docker服务检查--pretty nh_blazegraph

ID:             ef9s5lesysovh5x2653qc6dk9
Name:           nh_blazegraph
Labels:
 com.docker.stack.image=lyrasis/blazegraph:2.1.5
 com.docker.stack.namespace=nh
Service Mode:   Replicated
 Replicas:      1
Placement:
 Constraints:   [node.role == worker]
UpdateConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first
RollbackConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Rollback order:    stop-first
ContainerSpec:
 Image:         lyrasis/blazegraph:2.1.5@sha256:e9fb46c9d7b2fc23202945a3d71b99ad8df2d7a18dcbcccc04cfc4f791b569e9
Resources:
Networks: news-hunter
Endpoint Mode:  vip
Ports:
 PublishedPort = 8089
  Protocol = tcp
  TargetPort = 8080
  PublishMode = ingress

其他环境详细信息(AWS,VirtualBox,物理等):

我们正在与OpenStack IaaS云提供商合作.外出工作负载每分钟可预期来自外部源的1000个http请求和节点之间的5000个请求.

We are working on a OpenStack IaaS cloud provider. Out workload can expect more than 1000 http request per minute from external sources and more than 5000 requests between nodes.

交叉发布:

https://forums.docker.com/t/swarm-mode-routing-mesh-not-working-instead-is-using-host-mode-by-default/89731 https://github.com/moby/moby/issues/40590

推荐答案

这表明vxlan的覆盖端口在群集中的节点之间被阻止.vxlan使用的端口是:

This is an indication that the overlay ports for vxlan are being blocked between nodes in the cluster. The ports used by vxlan are:

  • 用于节点之间通信的TCP和UDP端口7946
  • 用于覆盖网络流量的UDP端口4789

来源: https://docs.docker.com/network/overlay/

显示的iptables指示这不是在Linux主机本身内完成的(输入和输出策略默认配置为允许),因此我将关注用于运行VM的网络策略和系统.例如.VMware NSX使用这些端口并因此阻止了嵌入式VM.

The iptables shown indicate this is not being done within the Linux hosts themselves (input and output policies configured to allow by default), so I'd look towards the network policies and system used to run the VMs. E.g. VMware NSX uses these ports and blocked the embedded VMs as a result.

这篇关于群集模式路由网格不起作用,而是默认情况下像主机模式一样工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆