是什么导致如此多的 TIME_WAIT 连接打开? [英] What could cause so many TIME_WAIT connections to be open?

查看:53
本文介绍了是什么导致如此多的 TIME_WAIT 连接打开?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我在一台服务器上有应用程序 A,它每秒向另一台服务器上的应用程序 B 发送 710 个 HTTP POST 消息,该服务器正在侦听单个端口.连接不是保持活动状态;他们关门了.

So, I have application A on one server which sends 710 HTTP POST messages per second to application B on another server, which is listening on a single port. The connections are not keep-alive; they are closed.

几分钟后,应用程序 A 报告它无法打开到应用程序 B 的新连接.

After a few minutes, application A reports that it can't open new connections to application B.

我在两台机器上连续运行 netstat,看到每台机器上都打开了大量的 TIME_WAIT 连接.几乎所有显示的连接都在 TIME_WAIT 中.从网上看,似乎这是每一方关闭连接后30秒(在我们的机器上根据/proc/sys/net/ipv4/tcp_fin_timeout值是30秒)的状态.

I am running netstat continuously on both machines, and see that a huge number of TIME_WAIT connections are open on each. Virtually all connections showing are in TIME_WAIT. From reading online, it seems that this is the state it's in for 30 seconds (on our machines 30 seconds according to /proc/sys/net/ipv4/tcp_fin_timeout value) after each side closes the connection.

我在每台机器上都运行了一个脚本,该脚本一直在执行:

I have a script running on each machine that's continuously doing:

netstat -na | grep 5774 | wc -l

和:

netstat -na | grep 5774 | grep "TIME_WAIT" | wc -l

在应用程序 A 报告它无法打开到应用程序 B 的新连接之前,每台机器上的 each 值似乎达到了大约 28,000.

The value of each, on each machine, seems to get to around 28,000 before application A reports that it can't open new connections to application B.

我读过这个文件:/proc/sys/net/ipv4/ip_local_port_range 提供了可以同时打开的连接总数:

I've read that this file: /proc/sys/net/ipv4/ip_local_port_range provides the total number of connections that can be open at once:

$ cat/proc/sys/net/ipv4/ip_local_port_range32768 61000

$ cat /proc/sys/net/ipv4/ip_local_port_range 32768 61000

61000 - 32768 = 28232,这与我所看到的大约 28,000 次 TIME_WAIT 相符.

61000 - 32768 = 28232, which is right in line with the approximately 28,000 TIME_WAITs I am seeing.

我的问题是在 TIME_WAIT 中怎么可能有这么多连接.

My question is how is it possible to have so many connections in TIME_WAIT.

似乎在每秒关闭 710 个连接时,我应该在给定时间看到大约 710 * 30 秒 = 21300 个.我想仅仅因为每秒打开 710 个并不意味着每秒有 710 个关闭...

It seems that at 710 connections per second being closed, I should see approximately 710 * 30 seconds = 21300 of these at a given time. I suppose that just because there are 710 being opened per second doesn't mean that there are 710 being closed per second...

我唯一能想到的就是一个缓慢的操作系统试图关闭连接.

The only other thing I can think of is a slow OS getting around to closing the connections.

推荐答案

TCP 的 TIME_WAIT表示本地端点(此端)已关闭连接.连接被保留,以便任何延迟的数据包都可以与连接匹配并得到适当的处理.连接将在四分钟内超时时删除.

TCP's TIME_WAIT indicates that local endpoint (this side) has closed the connection. The connection is being kept around so that any delayed packets can be matched to the connection and handled appropriately. The connections will be removed when they time out within four minutes.

假设所有这些连接都有效,那么一切正常.您可以通过让远程端关闭连接来消除 TIME_WAIT 状态,或者您可以修改系统参数以增加回收率(尽管这样做可能很危险).

Assuming that all of those connections were valid, then everything is working correctly. You can eliminate the TIME_WAIT state by having the remote end close the connection or you can modify system parameters to increase recycling (though it can be dangerous to do so).

Vincent Bernat 有一篇关于 TIME_WAIT 的优秀文章以及如何处理它:

Vincent Bernat has an excellent article on TIME_WAIT and how to deal with it:

Linux 内核文档对 net.ipv4.tcp_tw_recycle 的作用不是很有帮助:

The Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle does:

启用快速回收 TIME-WAIT 套接字.默认值为 0.它应该未经技术专家的建议/请求,不得更改.

Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.

它的兄弟,net.ipv4.tcp_tw_reuse 的文档有点多,但语言大致相同:

Its sibling, net.ipv4.tcp_tw_reuse is a little bit more documented but the language is about the same:

允许在安全的情况下为新连接重用 TIME-WAIT 套接字从协议的角度来看.默认值为 0,不应更改无需技术专家的建议/请求.

Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.

缺乏文档的唯一结果是我们发现许多调整指南建议将这两个设置都设置为 1,以减少 TIME-WAIT 状态中的条目数.但是,正如 tcp(7) 手册页所述,net.ipv4.tcp_tw_recycle 选项对于面向公众的服务器来说是相当有问题的,因为它不会处理来自同一 NAT 设备后面的两台不同计算机的连接,这是一个难以检测并等待咬你的问题:

The mere result of this lack of documentation is that we find numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT state. However, as stated by tcp(7) manual page, the net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won’t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you:

启用 TIME-WAIT 套接字的快速回收.启用此选项是不推荐,因为这会在使用 NAT 时导致问题(网络地址转换).

Enable fast recycling of TIME-WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).

这篇关于是什么导致如此多的 TIME_WAIT 连接打开?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆