Node.js服务器超时问题(EC2 + Express + PM2) [英] Node.js Server Timeout Problems (EC2 + Express + PM2)

查看:533
本文介绍了Node.js服务器超时问题(EC2 + Express + PM2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我比较新的运行生产node.js应用程序,最近我的服务器超时出现问题。



基本上一定量的使用& time我的node.js应用程序停止响应请求。我甚至没有看到路由在我的控制台上被触发了 - 就像整个事情刚刚停止,我的客户端(iPhone运行AFNetworking)的HTTP呼叫不再到达服务器了。但是如果我重新启动了我的node.js应用服务器,所有的一切都会重新开始工作,直到不可避免地再次停止。该应用程序不会崩溃,它只是停止响应请求。



我没有得到任何错误,我已经确保处理和记录所有数据库连接错误,所以我不知道从哪里开始我认为这可能与内存泄漏有关,所以我安装了 node-memwatch 并设置了内存泄漏的侦听器,但是在我的服务器停止响应请求之前不会被调用。 >

任何可能发生的事情的线索,以及我如何解决这个问题?



这是我的堆栈:




  • AWS EC2 Micro实例上的Node.js(使用Express 4.0 + PM2)

  • <在AWS RDS上运行MySQL的数据库(使用node-mysql)
  • 存储在与node.js应用程序相同的EC2实例上的Redis的会话

  • 客户端是通过AFNetworking访问服务器的iPhone



再次,上述任何模块都无法触发任何错误。 >

解决方案

首先,您需要更加具体的超时。




  • TCP超时:TCP将消息分成数据包h一个个发送。接收器需要确认已收到数据包。如果接收方在一段时间内未确认收到包,则会发生TCP重新发送,该TCP重发是再发送相同的数据包。如果这种情况再发生几次,发送者放弃并杀死连接。


  • HTTP超时:HTTP客户端作为客户端的浏览器或服务器(例如:向其他HTTP服务器发送请求)可以设置任意超时。如果在这段时间内没有收到回复,它将断开连接并将其称为超时。




有很多很多可能的原因,从...更简单到更不重要:




  • 内容错误 - 长度计算:如果您发送的请求包含 Content-Length:20 标题,则表示我要发送给您20个字节。如果你发送19,另一端将等待剩余的1.如果这太长了...超时。


  • / strong>:也许你应该为应用程序分配更多的机器。如果(总负载/ CPU核心数量)超过1,或者您的内存使用率很高,则系统可能会超载。但是继续阅读...


  • 静音异常:抛出错误但未记录任何地方。请求未完成处理,导致下一个项目。


  • 资源泄漏:每个请求都需要处理完成。如果不这样做,连接将保持打开状态。另外,另外还有其他对象(例如,代码中的 IncomingMesage 对象(又名:通常称为 req ) :表达自己)。每个对象都可以使用大量的内存。


  • 节点事件循环不足:我将在最后







对于内存泄漏,症状将是:
节点进程将使用越来越多的内存。



为了使事情变得更糟,如果可用内存不足,并且您的服务器配置错误以使用交换,Linux将开始将内存移动到磁盘(交换),这是非常I / O和CPU密集型的。服务器不应该启用交换。

  cat / proc / sys / vm / swappiness 

将返回系统中配置的swappiness级别(从0到100)。您可以通过 /etc/sysctl.conf (需要重新启动)或以易失性方式使用以下方式修改它: sysctl vm.swappiness =一旦您建立了内存泄漏,您需要获得核心转储并下载才能进行分析。在其他Stackoverflow响应中可以找到一种方法:工具分析来自Node.js的核心转储



对于连接泄漏(您通过不处理完成的请求泄漏连接),您将会增加与服务器建立的连接数。您可以使用 netstat -a -p tcp |检查已建立的连接grep ESTABLISHED | wc -l <​​/ code>可用于计算已建立的连接。



现在,事件循环不足是最糟糕的问题。如果你有短寿命代码节点工作得很好。但是,如果你做了CPU密集的东西,并且有一个功能,使CPU占用过多的时间,如50 ms(50 ms的固态,阻塞,同步CPU时间,不是异步代码占用50 ms),操作是由事件循环处理,例如处理HTTP请求开始落后并最终超时。



找到CPU瓶颈的方法是使用性能分析器。我的首选分析工具,但其他人喜欢火焰等等。 nodegrind / qcachegrind 但是,在生产中可能难以运行分析器。只需要一个开发服务器,并用请求命令。 aka:负载测试。有这么多的工具。






最后,另一种调试问题的方法是:



env NODE_DEBUG = tls,net node< ...你的应用程序的参数>



节点具有通过 NODE_DEBUG 环境变量启用的可选调试语句。将 NODE_DEBUG 设置为 tls,net 将使节点发出tls和net modules的调试信息...所以基本上都是被发送或接收。如果有超时时间,您将看到它的来源。



资料来源:多年来维护大型部署节点服务的经验。 >

I'm relatively new to running production node.js apps and I've recently been having problems with my server timing out.

Basically after a certain amount of usage & time my node.js app stops responding to requests. I don't even see routes being fired on my console anymore - it's like the whole thing just comes to a halt and the HTTP calls from my client (iPhone running AFNetworking) don't reach the server anymore. But if I restart my node.js app server everything starts working again, until things inevitable stop again. The app never crashes, it just stops responding to requests.

I'm not getting any errors, and I've made sure to handle and log all DB connection errors so I'm not sure where to start. I thought it might have something to do with memory leaks so I installed node-memwatch and set up a listener for memory leaks but that doesn't get called before my server stops responding to requests.

Any clue as to what might be happening and how I can solve this problem?

Here's my stack:

  • Node.js on AWS EC2 Micro Instance (using Express 4.0 + PM2)
  • Database on AWS RDS volume running MySQL (using node-mysql)
  • Sessions stored w/ Redis on same EC2 instance as the node.js app
  • Clients are iPhones accessing the server via AFNetworking

Once again no errors are firing with any of the modules mentioned above.

解决方案

First of all you need to be a bit more specific about timeouts.

  • TCP timeouts: TCP divides a message into packets which are sent one by one. The receiver needs to acknowledge having received the packet. If the receiver does not acknowledge having received the package within certain period of time, a TCP retransmission occurs, which is sending the same packet again. If this happens a couple of more times, the sender gives up and kills the connection.

  • HTTP timeout: An HTTP client like a browser, or your server while acting as a client (e.g: sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within that period of time, it will disconnect and call it a timeout.

Now, there are many, many possible causes for this... from more trivial to less trivial:

  • Wrong Content-Length calculation: If you send a request with a Content-Length: 20 header, that means "I am going to send you 20 bytes". If you send 19, the other end will wait for the remaining 1. If that takes too long... timeout.

  • Not enough infrastructure: Maybe you should assign more machines to your application. If (total load / # of CPU cores) is over 1, or your memory usage is high, your system may be over capacity. However keep reading...

  • Silent exception: An error was thrown but not logged anywhere. The request never finished processing, leading to the next item.

  • Resource leaks: Every request needs to be handled to completion. If you don't do this, the connection will remain open. In addition, the IncomingMesage object (aka: usually called req in express code) will remain referenced by other objects (e.g: express itself). Each one of those objects can use a lot of memory.

  • Node event loop starvation: I will get to that at the end.


For memory leaks, the symptoms would be: the node process would be using an increasing amount of memory.

To make things worse, if available memory is low and your server is misconfigured to use swapping, Linux will start moving memory to disk (swapping), which is very I/O and CPU intensive. Servers should not have swapping enabled.

cat /proc/sys/vm/swappiness

will return you the level of swappiness configured in your system (goes from 0 to 100). You can modify it in a persistent way via /etc/sysctl.conf (requires restart) or in a volatile way using: sysctl vm.swappiness=10

Once you've established you have a memory leak, you need to get a core dump and download it for analysis. A way to do that can be found in this other Stackoverflow response: Tools to analyze core dump from Node.js

For connection leaks (you leaked a connection by not handling a request to completion), you would be having an increasing number of established connections to your server. You can check your established connections with netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.

Now, the event loop starvation is the worst problem. If you have short lived code node works very well. But if you do CPU intensive stuff and have a function that keeps the CPU busy for an excessive amount of time... like 50 ms (50 ms of solid, blocking, synchronous CPU time, not asynchronous code taking 50 ms), operations being handled by the event loop such as processing HTTP requests start falling behind and eventually timing out.

The way to find a CPU bottleneck is using a performance profiler. nodegrind/qcachegrind are my preferred profiling tools but others prefer flamegraphs and such. However it can be hard to run a profiler in production. Just take a development server and slam it with requests. aka: a load test. There are many tools for this.


Finally, another way to debug the problem is:

env NODE_DEBUG=tls,net node <...arguments for your app>

node has optional debug statements that are enabled through the NODE_DEBUG environment variable. Setting NODE_DEBUG to tls,net will make node emit debugging information for the tls and net modules... so basically everything being sent or received. If there's a timeout you will see where it's coming from.

Source: Experience of maintaining large deployments of node services for years.

这篇关于Node.js服务器超时问题(EC2 + Express + PM2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆