Linux 3.2内核与2.6内核接受不良平衡的套接字 [英] Poorly-balanced socket accepts with Linux 3.2 kernel vs 2.6 kernel

查看:120
本文介绍了Linux 3.2内核与2.6内核接受不良平衡的套接字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个相当大规模的Node.js 0.8.8应用程序,使用带有16个工作进程的Cluster,在具有超线程的16处理器盒子上(所以32个逻辑核心)。我们发现自从迁移到Linux 3.2.0内核(从2.6.32开始)之后,工作者子进程之间的传入请求的平衡似乎被大量加权到5个左右的进程,而其他11个进程根本没有做太多工作。这可能对吞吐量更有效,但似乎增加了请求延迟并且对我们来说不是最佳的,因为其中许多是长期的websocket连接,可以同时开始工作。



子进程都在套接字上接受(使用epoll),虽然这个问题在Node 0.9(https://github.com/bnoordhuis/libuv/commit/be2a2176ce25d6a4190b10acd1de9fd53f7a6275)中有一个修复,但是在我们的测试中似乎没有帮助。是否有人知道内核调整参数或构建选项可能有所帮助,或者我们最好使用不同的方法回到2.6内核或跨工作进程进行负载平衡?



<我们将其归结为一个简单的HTTP Siege测试,但请注意,这是在12核上有超线程(因此有24个逻辑核)的12个盒子上运行,并且有12个工作进程在套接字上接受,而不是我们的16个触发器正在生产中。



Debian Squeeze上的Node Siege与裸机上的2.6.32内核的HTTP Siege:

  reqs pid 
146 2818
139 2820
211 2821
306 2823
129 2825
166 2827
138 2829
134 2831
227 2833
134 2835
129 2837
138 2838

除了3.2.0内核以外的一切:

  reqs pid 
99 3207
186 3209
42 3210
131 3212
34 3214
53 3216
39 3218
54 3220
33 3222
931 3224
345 3226
312 3228


解决方案

不要依赖操作系统的套接字多次接受来平衡网络服务器进程的负载。



Linux内核的行为在版本与版本之间有所不同,我们看到3.2内核的行为特别不平衡,在以后的版本中似乎更加平衡。例如3.6。



我们的运作假设应该有一种方法让Linux做类似循环的事情,但这有很多问题,包括:




  • Linux内核2.6显示出类似裸机上的循环行为(不平衡大约为3比1),Linux内核3.2没有(10比1不平衡),内核3.6.10似乎再次没问题。我们没有尝试将实际更改分成两部分。

  • 无论使用何种内核版本或构建选项,我们在Amazon Web服务上的32位逻辑核心HVM实例上看到的行为是对一个过程严重加权; Xen socket可能存在问题: https:// serverfault .com / questions / 272483 / why-is-tcp-accept-performance-so-bad-under-xen



您可以在我们用来与优秀的Node.js团队对应的github问题上详细查看我们的测试,从这里开始: https://github.com/joyent/node/issues/3241#issuecomment-11145233



该对话以Node.js团队结束,表明他们正在认真考虑在Cluster中实现显式循环,并为此开始一个问题: https://github.com/joyent/node/issues/4435 ,以及Trello团队(我们)参与我们的后备计划,即使用本地HAProxy过程到每个服务器计算机上16个端口之间的代理,每个端口上运行一个2工作进程集群实例(在进程崩溃或挂起的情况下,在接受级别进行快速故障转移)。该计划运行良好,请求延迟的变化大大减少,平均延迟也较低。



此处还有很多内容,我没有采取邮寄Linux内核邮件列表的步骤,因为目前还不清楚这是否真的是Xen或Linux内核问题,或者实际上只是我们对多重接受行为的错误预期。



我很乐意看到专家对多重接受的答案,但我们将回到我们可以使用我们更了解的组件构建的内容。如果有人发布更好的答案,我会很高兴接受它而不是我的。


I am running a fairly large-scale Node.js 0.8.8 app using Cluster with 16 worker processes on a 16-processor box with hyperthreading (so 32 logical cores). We are finding that since moving to the Linux 3.2.0 kernel (from 2.6.32), the balancing of incoming requests between worker child processes seems be heavily weighted to 5 or so processes, with the other 11 not doing much work at all. This may be more efficient for throughput, but seems to increase request latency and is not optimal for us because many of these are long-lived websocket connections that can start doing work at the same time.

The child processes are all accepting on a socket (using epoll), and while this problem has a fix in Node 0.9 (https://github.com/bnoordhuis/libuv/commit/be2a2176ce25d6a4190b10acd1de9fd53f7a6275), that fix does not seem to help in our tests. Is anyone aware of kernel tuning parameters or build options that could help, or are we best-off moving back to the 2.6 kernel or load balancing across worker processes using a different approach?

We boiled it down to a simple HTTP Siege test, though note that this is running with 12 procs on a 12-core box with hyperthreading (so 24 logical cores), and with 12 worker processes accepting on the socket, as opposed to our 16 procs in production.

HTTP Siege with Node 0.9.3 on Debian Squeeze with 2.6.32 kernel on bare metal:

reqs pid
146  2818
139  2820
211  2821
306  2823
129  2825
166  2827
138  2829
134  2831
227  2833
134  2835
129  2837
138  2838

Same everything except with the 3.2.0 kernel:

reqs pid
99   3207
186  3209
42   3210
131  3212
34   3214
53   3216
39   3218
54   3220
33   3222
931  3224
345  3226
312  3228

解决方案

Don't depend on the OS's socket multiple accept to balance load across web server processes.

The Linux kernels behavior differs here from version to version, and we saw a particularly imbalanced behavior with the 3.2 kernel, which appeared to be somewhat more balanced in later versions. e.g. 3.6.

We were operating under the assumption that there should be a way to make Linux do something like round-robin with this, but there were a variety of issues with this, including:

  • Linux kernel 2.6 showed something like round-robin behavior on bare metal (imbalances were about 3-to-1), Linux kernel 3.2 did not (10-to-1 imbalances), and kernel 3.6.10 seemed okay again. We did not attempt to bisect to the actual change.
  • Regardless of the kernel version or build options used, the behavior we saw on a 32-logical-core HVM instance on Amazon Web services was severely weighted toward a single process; there may be issues with Xen socket accept: https://serverfault.com/questions/272483/why-is-tcp-accept-performance-so-bad-under-xen

You can see our tests in great detail on the github issue we were using to correspond with the excellent Node.js team, starting about here: https://github.com/joyent/node/issues/3241#issuecomment-11145233

That conversation ends with the Node.js team indicating that they are seriously considering implementing explicit round-robin in Cluster, and starting an issue for that: https://github.com/joyent/node/issues/4435, and with the Trello team (that's us) going to our fallback plan, which was to use a local HAProxy process to proxy across 16 ports on each server machine, with a 2-worker-process Cluster instance running on each port (for fast failover at the accept level in case of process crash or hang). That plan is working beautifully, with greatly reduced variation in request latency and a lower average latency as well.

There is a lot more to be said here, and I did NOT take the step of mailing the Linux kernel mailing list, as it was unclear if this was really a Xen or a Linux kernel issue, or really just an incorrect expectation of multiple accept behavior on our part.

I'd love to see an answer from an expert on multiple accept, but we're going back to what we can build using components that we understand better. If anyone posts a better answer, I would be delighted to accept it instead of mine.

这篇关于Linux 3.2内核与2.6内核接受不良平衡的套接字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆