Linux内核丢弃UDP数据包 [英] UDP packet drops by linux kernel
问题描述
我有一个通过多播发送 UDP 数据包的服务器和一些列出这些多播数据包的客户端.每个数据包的大小固定为 1040 Bytes,服务器发送的全部数据大小为 3GByte.
I have a server which sends UDP packets via multicast and a number of clients which are listing to those multicast packets. Each packet has a fixed size of 1040 Bytes, the whole data size which is sent by the server is 3GByte.
我的环境如下:
1 Gbit 以太网
40 个节点、1 个发送节点和 39 个接收节点.所有节点具有相同的硬件配置:2 个 AMD CPU,每个 CPU 有 2 个 Cores @2,6GHz
40 Nodes, 1 Sender Node and 39 receiver Nodes. All Nodes have the same hardware configuration: 2 AMD CPUs, each CPU has 2 Cores @2,6GHz
在客户端,一个线程读取套接字并将数据放入队列.一个额外的线程从队列中弹出数据并进行一些轻量级的处理.
On the client side, one thread reads the socket and put the data into a queue. One additional thread pops the data from the queue and does some light weight processing.
在多播传输期间,我发现节点侧的丢包率为 30%.通过观察 netstat –su 统计信息,我可以说客户端应用程序丢失的数据包等于 netstat 输出中的 RcvbufErrors 值.
During the multicast transmission I recognize a packet drop rate of 30% on the node side. By observing the netstat –su statistics I can say, that the missing packets by the client application are equal to the RcvbufErrors value from the netstat output.
这意味着所有丢失的数据包都被操作系统丢弃,因为套接字缓冲区已满,但我不明白为什么捕获线程无法及时读取缓冲区.在传输过程中,4 个核心中的 2 个被 75% 使用,其余的处于休眠状态.我是唯一使用这些节点的人,我会假设这种机器处理 1Gbit 带宽没有问题.我已经做了一些优化,通过为amd cpus添加g++编译器标志,这将丢包率降低到10%,但在我看来还是太高了.
That means that all missing packets are dropped by the OS because the socket buffer was full, but I do not understand why the capturing thread is not able to read the buffer in time. During the transmission, 2 of the 4 cores are utilized by 75%, the rest is sleeping. I’m the only one who is using these nodes, and I would assume that this kind of machines have no problem to handle 1Gbit bandwidth. I have already done some optimization, by adding g++ compiler flags for amd cpus, this decrease the packet drop rate to 10%, but it is still too high in my opinion.
我当然知道UDP不可靠,我有自己的修正协议.
Of course I know that UDP is not reliable, I have my own correction protocol.
我没有任何管理权限,因此无法更改系统参数.
I do not have any administration permissions, so it’s not possible for me to change the system parameters.
任何提示如何提高性能?
Any hints how can I increase the performance?
我通过使用 2 个正在读取套接字的线程解决了这个问题.recv 套接字缓冲区有时仍会变满.但平均跌幅在 1% 以下,因此处理起来不成问题.
I solved this issue by using 2 threads which are reading the socket. The recv socket buffer still becomes full sometimes. But the average drop is under 1%, so it isn't a problem to handle it.
推荐答案
在 Linux 上追踪网络丢包可能有点困难,因为有很多组件可能会发生丢包.它们可以发生在硬件级别、网络设备子系统或协议层中.
Tracking down network drops on Linux can be a bit difficult as there are many components where packet drops can happen. They can occur at the hardware level, in the network device subsystem, or in the protocol layers.
我写了一个非常详细的博客文章解释了如何监控和调整每个组件.这里有点难以概括为一个简洁的答案,因为需要监控和调整的组件太多了.
I wrote a very detailed blog post explaining how to monitor and tune each component. It's a bit hard to summarize as a succinct answer here since there are so many different components that need to be monitored and tuned.
这篇关于Linux内核丢弃UDP数据包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!