从某些Kubernetes容器到同一集群中其他容器的主机没有路由 [英] No route to host from some Kubernetes containers to other containers in same cluster

查看:1016
本文介绍了从某些Kubernetes容器到同一集群中其他容器的主机没有路由的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是使用calico的Kubespray部署.除存在代理外,所有默认设置均保持原样. Kubespray毫无问题地跑到了尽头.

This is a Kubespray deployment using calico. All the defaults are were left as-is except for the fact that there is a proxy. Kubespray ran to the end without issues.

开始访问Kubernetes服务失败,经过调查,没有托管主机 coredns 服务的路径.通过IP访问K8S服务的工作正常.其他所有事情似乎都是正确的,所以我剩下一个可以工作的群集,但是没有DNS.

Access to Kubernetes services started failing and after investigation, there was no route to host to the coredns service. Accessing a K8S service by IP worked. Everything else seems to be correct, so I am left with a cluster that works, but without DNS.

以下是一些背景信息: 启动busybox容器:

Here is some background information: Starting up a busybox container:

# nslookup kubernetes.default
Server:     169.254.25.10
Address:    169.254.25.10:53

** server can't find kubernetes.default: NXDOMAIN

*** Can't find kubernetes.default: No answer

现在,在显式定义CoreDNS pod之一的IP时输出:

Now the output while explicitly defining the IP of one of the CoreDNS pods:

# nslookup kubernetes.default 10.233.0.3
;; connection timed out; no servers could be reached

请注意,使用telnet连接Kubernetes API可以:

Notice that telnet to the Kubernetes API works:

# telnet 10.233.0.1 443
Connected to 10.233.0.1

kube-proxy日志: 10.233.0.3是coredns的服务IP.最后一行看起来很令人担忧,即使它是INFO.

kube-proxy logs: 10.233.0.3 is the service IP for coredns. The last line looks concerning, even though it is INFO.

$ kubectl logs kube-proxy-45v8n -nkube-system
I1114 14:19:29.657685       1 node.go:135] Successfully retrieved node IP: X.59.172.20
I1114 14:19:29.657769       1 server_others.go:176] Using ipvs Proxier.
I1114 14:19:29.664959       1 server.go:529] Version: v1.16.0
I1114 14:19:29.665427       1 conntrack.go:52] Setting nf_conntrack_max to 262144
I1114 14:19:29.669508       1 config.go:313] Starting service config controller
I1114 14:19:29.669566       1 shared_informer.go:197] Waiting for caches to sync for service config
I1114 14:19:29.669602       1 config.go:131] Starting endpoints config controller
I1114 14:19:29.669612       1 shared_informer.go:197] Waiting for caches to sync for endpoints config
I1114 14:19:29.769705       1 shared_informer.go:204] Caches are synced for service config 
I1114 14:19:29.769756       1 shared_informer.go:204] Caches are synced for endpoints config 
I1114 14:21:29.666256       1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.124.23:53
I1114 14:21:29.666380       1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.122.11:53

所有Pod都在运行时没有崩溃/重新启动等,否则服务运行正常.

All pods are running without crashing/restarts etc. and otherwise services behave correctly.

IPVS看起来正确.在那里定义了CoreDNS服务:

IPVS looks correct. CoreDNS service is defined there:

# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.233.0.1:443 rr
  -> x.59.172.19:6443           Masq    1      0          0         
  -> x.59.172.20:6443           Masq    1      1          0         
TCP  10.233.0.3:53 rr
  -> 10.233.122.12:53             Masq    1      0          0         
  -> 10.233.124.24:53             Masq    1      0          0         
TCP  10.233.0.3:9153 rr
  -> 10.233.122.12:9153           Masq    1      0          0         
  -> 10.233.124.24:9153           Masq    1      0          0         
TCP  10.233.51.168:3306 rr
  -> x.59.172.23:6446           Masq    1      0          0         
TCP  10.233.53.155:44134 rr
  -> 10.233.89.20:44134           Masq    1      0          0         
UDP  10.233.0.3:53 rr
  -> 10.233.122.12:53             Masq    1      0          314       
  -> 10.233.124.24:53             Masq    1      0          312

主机路由也看起来正确.

Host routing also looks correct.

# ip r
default via x.59.172.17 dev ens3 proto dhcp src x.59.172.22 metric 100 
10.233.87.0/24 via x.59.172.21 dev tunl0 proto bird onlink 
blackhole 10.233.89.0/24 proto bird 
10.233.89.20 dev calib88cf6925c2 scope link 
10.233.89.21 dev califdffa38ed52 scope link 
10.233.122.0/24 via x.59.172.19 dev tunl0 proto bird onlink 
10.233.124.0/24 via x.59.172.20 dev tunl0 proto bird onlink 
x.59.172.16/28 dev ens3 proto kernel scope link src x.59.172.22 
x.59.172.17 dev ens3 proto dhcp scope link src x.59.172.22 metric 100 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

我已经在单独的环境中使用法兰绒和印花棉布(使用iptables而不是ipvs)重新部署了该群集.临时部署后,我还禁用了docker http代理.没有任何区别.

I have redeployed this same cluster in separate environments with flannel and calico with iptables instead of ipvs. I have also disabled the docker http proxy after deploy temporarily. None of which makes any difference.

也: kube_service_addresses:10.233.0.0/18 kube_pods_subnet:10.233.64.0/18 (它们不重叠)

Also: kube_service_addresses: 10.233.0.0/18 kube_pods_subnet: 10.233.64.0/18 (They do not overlap)

调试此问题的下一步是什么?

What is the next step in debugging this issue?

推荐答案

我强烈建议您避免使用最新的busybox映像对DNS进行故障排除.关于dnslookup的版本低于1.28的报告问题.

I highly recommend you to avoid using latest busybox image to troubleshoot DNS. There are few issues reported regarding dnslookup on versions newer than 1.28.

v 1.28.4

user@node1:~$ kubectl exec -ti busybox busybox | head -1
BusyBox v1.28.4 (2018-05-22 17:00:17 UTC) multi-call binary.

user@node1:~$ kubectl exec -ti busybox -- nslookup kubernetes.default 
Server:    169.254.25.10
Address 1: 169.254.25.10

Name:      kubernetes.default
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

v 1.31.1

user@node1:~$ kubectl exec -ti busyboxlatest busybox | head -1
BusyBox v1.31.1 (2019-10-28 18:40:01 UTC) multi-call binary.

user@node1:~$ kubectl exec -ti busyboxlatest -- nslookup kubernetes.default 
Server:     169.254.25.10
Address:    169.254.25.10:53

** server can't find kubernetes.default: NXDOMAIN

*** Can't find kubernetes.default: No answer

command terminated with exit code 1

深入研究并探索更多可能性,我已在GCP上重现了您的问题,经过一番挖掘,我能够弄清楚是什么原因导致了此通信问题.

Going deeper and exploring more possibilities, I've reproduced your problem on GCP and after some digging I was able to figure out what is causing this communication problem.

默认情况下,GCE(Google Compute Engine)会阻止主机之间的流量;我们必须允许Calico流量在不同主机上的容器之间流动.

GCE (Google Compute Engine) blocks traffic between hosts by default; we have to allow Calico traffic to flow between containers on different hosts.

根据印花布文档,您可以通过创建允许此通信规则的防火墙来做到这一点:

According to calico documentation, you can do it by creating a firewall allowing this communication rule:

gcloud compute firewall-rules create calico-ipip --allow 4 --network "default" --source-ranges "10.128.0.0/9"

您可以使用以下命令验证规则:

You can verify the rule with this command:

gcloud compute firewall-rules list

在最新的calico文档中没有提供此功能,但这仍然是正确且必要的.

This is not present on the most recent calico documentation but it's still true and necessary.

在创建防火墙规则之前:

Before creating firewall rule:

user@node1:~$ kubectl exec -ti busybox2 -- nslookup kubernetes.default 
Server:    10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local

nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

创建防火墙规则后:

user@node1:~$ kubectl exec -ti busybox2 -- nslookup kubernetes.default 
Server:    10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

是否使用kubespray或kubeadm引导群集都没有关系,将发生此问题,因为calico需要在节点之间进行通信,而GCE默认将其阻止.

It doesn't matter if you bootstrap your cluster using kubespray or kubeadm, this problem will happen because calico needs to communicate between nodes and GCE is blocking it as default.

这篇关于从某些Kubernetes容器到同一集群中其他容器的主机没有路由的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆