一段时间后Docker服务停止通信 [英] Docker services stops communicating after some time

查看:894
本文介绍了一段时间后Docker服务停止通信的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我总共有6个在docker swarm中运行的容器. Kafka + Zookeeper,MongoDB,A,B,C和接口.接口是公共的主要访问点-仅此容器发布端口-5683.接口容器在启动期间连接到A,B和C.我正在使用docker-compose文件+ docker堆栈部署,每个服务都有一个名称用作接口的主机.一切都成功启动,并且工作正常.一段时间(20分钟,1小时,..)后,我无法请求接口.接口收到我的请求,但应用程序与服务A,B,C或所有这些都失去连接.如果我重新启动界面,它可以重新连接到服务A,B,C.

I have together 6 containers running in docker swarm. Kafka+Zookeeper, MongoDB, A, B, C and Interface. Interface is the main access point from public - only this container publish the port - 5683. The interface container connects to A, B and C during startup. I am using docker-compose file + docker stack deploy, each service has a name which is used as host for interface. Everything starts successfully and works fine. After some time (20 mins,1h,..) I am not able to make request to interface. Interface receives my requests but application lost connection with service A,B,C or all of them. If I restart interface, it's able to reconnect to services A,B,C.

首先,我认为这是应用程序的问题,因此我在每个服务(接口,A,B,C)上公开了2个新端口,并使用探查器和调试器对其进行连接.应用程序正常运行,没有泄漏,没有阻塞的线程,正常工作并等待连接.调试器向我显示,当我向接口发出请求并且接口尝试请求服务A时,引发了由对等异常重置的连接.

I firstly thought it's problem of application so I expose 2 new ports on each service (interface, A,B,C) and connect with profiler and debugger to them. Application is running properly, no leaks, no blocked threads, normally working and waiting for connections. Debugger shows me that when I make a request to interface and interface tries to request service A, Connection reset by peer exception was thrown.

在此调试过程中,我发现了一些有趣的东西.服务启动时,我将调试器附加到接口上,并且一段时间后调试器也断开了连接. +我无法重新连接它,直到我向容器->应用程序发出请求. PRoblem-握手失败.

During this debugging I found out interesting stuff. I attached debugger to interface when the services started and also debugger was disconnected after some time. + I was not able to reconnect it, until I made request to the container -> application. PRoblem - handshake failed.

我发现的另一个有趣的东西是我无法请求两个接口.所以我用Wireshark看看发生了什么,并且:SYN-ACK很好.然后应用程序发布一些数据,并用FIN,ACK响应接口.我假设当接口尝试请求服务A并确定连接的状态时也会发生这种情况.对于Netty服务器,接口A,B和C的代码库相同.

Another interesting stuff that I found out was that I was not able to request neither interface. So I used wireshark to see what's going on and: SYN - ACK was fine. Then application post some data and interface respond with FIN,ACK. I assume that this also happen when interface tries to request service A and it FIN the connection. Codebase of Interface, A,B and C is the same regarding netty server.

最后,我认为这不是应用程序问题.为什么?我试图将容器部署为服务.我分别运行每个容器,将每个容器的端口和服务端点发布为localhost. (不是覆盖网络).它正在工作.容器运行没有问题. +我一开始并没有说过,当Java应用程序(接口,A,B,C)作为独立应用程序运行时,它们可以毫无问题地运行-不在docker中.

Finally, I don't think it's a application issue. Why? I tried to deploy containers not as services. I run each container separately, published the ports of each and endpoint of services were set to localhost. (not overlay network). And it is working. Containers run without problem. + I didn't say at the beginning, that the the java applications (interface, A,B,C) runs without problem when they are running as standalone application - not in docker.

您能帮我个问题吗?为什么码头工人在覆盖网络的情况下关闭插座?

Could you please help me what could be the issue? Why the docker in case of overlay network is closing sockets?

我正在使用最新的码头工人.我也用的比较老.

I am using newest docker. I used also older.

推荐答案

最后,我能够解决问题.

Finally, I was able to solve the problem.

发生了什么,又是一次.接口打开到A,B,C的永久TCP连接.当您尝试将这些服务A,B,C作为独立的Java应用程序运行时,一切正常.当我们对它们进行docker化并成群运行时,它仅工作了几分钟.奇怪的是,当您从客户端向接口发出请求时,Interface和另一个服务之间的连接就中断了.

What was happening, one more time. Interface opens permanent TCP connection to A,B,C. When you try to run these services A,B,C as a standalone java applications, everything is working. When we dockerize them and run in swarm, it was working only few minutes. Strange was that the connection between Interface and another service was interrupted in the moment when you made a request from client to interface.

在进行了许多失败的测试并调试了每个容器之后,我尝试使用映射的端口并作为端点,分别指定localhost来运行每个docker容器. (每个容器暴露的端口和接口都连接到本地主机)发生了有趣的事情,它正在工作.当您像这样运行容器时,将使用不同的容器网络驱动程序.桥一.如果在群集中运行它,则将使用覆盖网络驱动程序.

After many many unsuccessful tests and debugging each container I tried to run each docker container separately, with mapped ports and as endpoint I specified localhost. (each container exposed ports and interface was connecting to localhost) Funny thing happen, it was working. When you run containers like this, different network driver for container is used. Bridge one. If you run it in swarm, overlay network driver is used.

因此,它必须与docker网络有关,而与应用程序本身无关.几分钟后,下一步是从每个容器中进行tcpdump,此时该容器应停止工作.非常有趣.

So it had to be something with the docker network, not with application itself. Next step was tcpdump from each container after couple of minutes, when it should stop working. It was very interesting.

  • 客户端->界面(确定,接受请求)
  • 接口->(转发请求,因为它属于A)A
    • 接口-> [POST]
    • A->接口[RESET]
    • Client -> Interface (OK, request accepted)
    • Interface ->(forward request because it belongs to A) A
      • Interface -> A [POST]
      • A -> Interface [RESET]

      A在几分钟后没有通讯就正在重置打开的TCP通讯.为什么?

      A was reseting opened TCP communication after couple of minutes without communication. Why?

      Docker使用IP虚拟服务器,并且IPVS维护自己的连接表. IPVS表中CLOSE_WAIT连接的默认超时为60秒.因此,当服务器在60秒后发送邮件时,IPVS连接将不再可用,并且该数据包对于新的TCP会话无效,并获得RST.在客户端,由于应用程序仍打开套接字,因此连接永远保持FIN_WAIT2状态.内核的fin_wait计时器仅适用于孤立的TCP套接字.

      Docker uses IP Virtual Server and IPVS maintains its own connection table. The default timeout for CLOSE_WAIT connections in IPVS table is 60 seconds. Hence when the server sends something after 60 seconds, the IPVS connection is no longer available and the packet looks invalid for a new TCP session and gets RST. On the client side, the connection remains forever in FIN_WAIT2 state because the app still has the socket open; kernel's fin_wait timer kicks in only for orphaned TCP sockets.

      这是我所读到的以及如何理解它.我不确定我对问题的解释是否正确,但是基于这些假设,我在接口与A,B,C服务之间实施了ping-pong,以防通信时间少于60秒.而且,它正在工作.

      This is what I read about it and how understand it. I am not sure if my explanation of problem is correct, but based on these assumptions I implemented ping-pong between Interface and A,B,C services in case there is no communication for <60seconds. And, it’s working.

      这篇关于一段时间后Docker服务停止通信的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆