什么可以导致TCP / IP丢弃数据包而不丢弃连接? [英] What can cause TCP/IP to drop packets without dropping the connection?

查看:546
本文介绍了什么可以导致TCP / IP丢弃数据包而不丢弃连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基于Web的应用程序和一个客户端,都是用Java编写的。对于它的价值,客户端和服务器都在Windows上。客户端通过 Apache HttpClient 发出HTTP GET。服务器阻塞最多一分钟,如果在该分钟内没有消息到达客户端,则服务器返回HTTP 204 No Content。否则,只要消息为客户端准备就绪,就会以HTTP 200的正文返回。

I have a web-based application and a client, both written in Java. For what it's worth, the client and server are both on Windows. The client issues HTTP GETs via Apache HttpClient. The server blocks for up to a minute and if no messages have arrived for the client within that minute, the server returns HTTP 204 No Content. Otherwise, as soon as a message is ready for the client, it is returned with the body of an HTTP 200 OK.

这让我感到困惑: 间歇性地为特定的客户端子集 - 总是客户端具有明显不稳定的网络连接 - 客户端发出GET,服务器接收并处理GET,但客户端永远坐着。启用客户端的调试日志,我看到HttpClient仍在等待响应的第一行。

Here is what has me puzzled: Intermittently for a specific subset of clients -- always clients with demonstrably flaky network connections -- the client issues a GET, the server receives and processes the GET, but the client sits forever. Enabling debugging logs for the client, I see that HttpClient is still waiting for the very first line of the response.

服务器上没有抛出异常,至少没有记录在任何地方,而不是Tomcat,而不是我的webapp。根据调试日志,服务器成功响应客户端的每一个迹象都表明。但是,客户没有显示收到任何东西的迹象。客户端无限期挂起 HttpClient.executeMethod 。在会话超时并且客户端采取导致另一个线程发出HTTP POST的操作后,这变得很明显。当然,POST失败,因为会话已过期。在某些情况下,会话到期与发出POST并发现此事实的客户端之间已经过了小时。在这整个时间里, executeMethod 仍在等待HTTP响应行。

There is no Exception thrown on the server, at least nothing logged anywhere, not by Tomcat, not by my webapp. According to debugging logs, there is every sign that the server successfully responded to the client. However, the client shows no sign of having received anything. The client hangs indefinitely in HttpClient.executeMethod. This becomes obvious after the session times out and the client takes action that causes another Thread to issue an HTTP POST. Of course, the POST fails because the session has expired. In some cases, hours have elapsed between the session expiring and the client issuing a POST and discovering this fact. For this entire time, executeMethod is still waiting for the HTTP response line.

当我使用WireShark查看内容时实际上是在线路级别,这种故障不会发生。也就是说,对于特定的客户来说,这种失败将在几个小时内发生,但是当WireShark在两端运行时,这些相同的客户端将在14小时内连续运行,而不会出现故障。

When I use WireShark to see what is really going on at the wire level, this failure does not occur. That is, this failure will occur within a few hours for specific clients, but when WireShark is running at both ends, these same clients will run overnight, 14 hours, without a failure.

有没有其他人遇到过这样的事情?世界上有什么可以导致它?我认为即使在短期网络故障中,TCP / IP也可以保证数据包传输。如果我设置SO_TIMEOUT并在超时后立即重试请求,则重试始终成功。 (当然,我首先中止超时请求并释放连接以确保使用新的套接字。)

Has anyone else encountered something like this? What in the world can cause it? I thought that TCP/IP guaranteed packet delivery even across short term network glitches. If I set an SO_TIMEOUT and immediately retry the request upon timeout, the retry always succeeds. (Of course, I first abort the timed-out request and release the connection to ensure that a new socket will be used.)

想法?想法? Java中是否有一些TCP / IP设置或Windows中的注册表设置可以对丢失的数据包进行更积极的TCP / IP重试?

Thoughts? Ideas? Is there some TCP/IP setting available to Java or a registry setting in Windows that will enable more aggressive TCP/IP retries on lost packets?

推荐答案

您是否确定服务器已成功将响应发送到似乎失败的客户端?我的意思是服务器发送了响应,客户端已经将响应发送回服务器。你应该在服务器端使用wireshark看到这个。如果您确定在服务器端发生了这种情况并且客户端仍然没有看到任何内容,则需要从服务器进一步查看链。是否涉及代理/反向代理服务器或NAT?

Are you absolutely sure that the server has successfully sent the response to the clients that seem to fail? By this I mean the server has sent the response and the client has ack'ed that response back to the server. You should see this using wireshark on the server side. If you are sure this has occured on the server side and the client still does not see anything, you need to look further up the chain from the server. Are there any proxy/reverse proxy servers or NAT involved?

TCP传输被认为是一种可靠的协议,但不保证传输。您的操作系统的TCP / IP堆栈将非常难以使用TCP重新传输将数据包传输到另一端。如果发生这种情况,你应该在服务器端的wireshark中看到这些。如果您看到过多的TCP重新传输,则通常是网络基础结构问题 - 即错误或配置错误的硬件/接口。 TCP重传对于短暂的网络中断非常有效,但在具有较长中断的网络上表现不佳。这是因为TCP / IP堆栈仅在计时器到期后才发送重传。在每次不成功的重传之后,该计时器通常会加倍。这是为了避免因重传而泛滥已经存在问题的网络。正如您可能想象的那样,这通常会导致应用程序出现各种超时问题。

The TCP transport is considered to be a reliable protocol, but it does not guarantee delivery. The TCP/IP stack of your OS will try pretty hard to get packets to the other end using TCP retransmissions. You should see these in wireshark on the server side if this is happening. If you see excessive TCP retransmissions, it is usually a network infrastructure issue - i.e. bad or misconfigured hardware/interfaces. TCP retransmissions works great for short network interruptions, but performs poorly on a network with a longer interruption. This is because the TCP/IP stack will only send retransmissions after a timer expires. This timer typically doubles after each unsuccessful retransmission. This is by design to avoid flooding an already problematic network with retransmissions. As you might imagine, this usually causes applications all sorts of timeout issues.

根据您的网络拓扑结构,您可能还需要将probe / wireshark / tcpdump置于其他中间位置网络中的位置。这可能需要一些时间来找出数据包的去向。

Depending on your network topology, you may also need to place probes/wireshark/tcpdump at other intermediate locations in the network. This will probably take some time to find out where the packets have gone.

如果我是你,我会继续用wireshark监控,直到问题再次发生。它很可能会。但是,听起来你最终会发现的就是你已经提到的 - 片状硬件。如果修复片状硬件是不可能的,您可能需要构建额外的应用程序级别超时和重试以尝试在软件中处理该问题。听起来你开始走这条路了。

If I were you I would keep monitoring with wireshark on all ends until the problem re-occurs. It mostly likely will. But, it sounds like what you will ultimately find is what you already mentioned - flaky hardware. If fixing the flaky hardware is out of the question, you may need to just build in extra application level timeouts and retries to attempt to deal with the issue in software. It sounds like you started going down this path.

这篇关于什么可以导致TCP / IP丢弃数据包而不丢弃连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆