Node.js 服务器超时问题(EC2 + Express + PM2) [英] Node.js Server Timeout Problems (EC2 + Express + PM2)

查看:54
本文介绍了Node.js 服务器超时问题(EC2 + Express + PM2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对运行生产 node.js 应用程序比较陌生,最近我的服务器超时出现问题.

I'm relatively new to running production node.js apps and I've recently been having problems with my server timing out.

基本上在使用一定量之后我的 node.js 应用程序停止响应请求的时间.我什至看不到在我的控制台上触发路由 - 就像整个事情都停止了,来自我的客户端(运行 AFNetworking 的 iPhone)的 HTTP 调用不再到达服务器.但是如果我重新启动我的 node.js 应用程序服务器,一切都会重新开始工作,直到事情不可避免地再次停止.该应用程序永远不会崩溃,它只是停止响应请求.

Basically after a certain amount of usage & time my node.js app stops responding to requests. I don't even see routes being fired on my console anymore - it's like the whole thing just comes to a halt and the HTTP calls from my client (iPhone running AFNetworking) don't reach the server anymore. But if I restart my node.js app server everything starts working again, until things inevitable stop again. The app never crashes, it just stops responding to requests.

我没有收到任何错误,而且我确保处理和记录所有数据库连接错误,所以我不确定从哪里开始.我认为这可能与内存泄漏有关,所以我安装了 node-memwatch 并设置了一个内存泄漏侦听器,但在我的服务器停止响应请求之前并没有被调用.

I'm not getting any errors, and I've made sure to handle and log all DB connection errors so I'm not sure where to start. I thought it might have something to do with memory leaks so I installed node-memwatch and set up a listener for memory leaks but that doesn't get called before my server stops responding to requests.

关于可能会发生什么以及我如何解决这个问题的任何线索?

Any clue as to what might be happening and how I can solve this problem?

这是我的堆栈:

  • AWS EC2 微型实例上的 Node.js(使用 Express 4.0 + PM2)
  • 运行 MySQL 的 AWS RDS 卷上的数据库(使用 node-mysql)
  • 使用 Redis 将会话存储在与 node.js 应用程序相同的 EC2 实例上
  • 客户端是通过 AFNetworking 访问服务器的 iPhone

再一次,上面提到的任何模块都没有出现错误.

Once again no errors are firing with any of the modules mentioned above.

推荐答案

首先,您需要更具体地了解超时.

First of all you need to be a bit more specific about timeouts.

  • TCP 超时:TCP 将消息分成多个数据包,这些数据包一个一个地发送.接收方需要确认已收到数据包.如果接收方在一定时间内没有确认收到该包,则发生 TCP 重传,即再次发送相同的包.如果这种情况再发生几次,发送方就会放弃并终止连接.

  • TCP timeouts: TCP divides a message into packets which are sent one by one. The receiver needs to acknowledge having received the packet. If the receiver does not acknowledge having received the package within certain period of time, a TCP retransmission occurs, which is sending the same packet again. If this happens a couple of more times, the sender gives up and kills the connection.

HTTP 超时:像浏览器这样的 HTTP 客户端或作为客户端的服务器(例如:向其他 HTTP 服务器发送请求)可以设置任意超时.如果在该时间段内未收到响应,它将断开连接并将其称为超时.

HTTP timeout: An HTTP client like a browser, or your server while acting as a client (e.g: sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within that period of time, it will disconnect and call it a timeout.

现在,造成这种情况的原因有很多很多……从更微不足道到更微不足道:

Now, there are many, many possible causes for this... from more trivial to less trivial:

  • Content-Length 计算错误:如果您发送带有 Content-Length: 20 标头的请求,这意味着我将向您发送 20字节".如果您发送 19 个,另一端将等待剩余的 1 个.如果时间过长...超时.

  • Wrong Content-Length calculation: If you send a request with a Content-Length: 20 header, that means "I am going to send you 20 bytes". If you send 19, the other end will wait for the remaining 1. If that takes too long... timeout.

基础设施不足:也许您应该为您的应用程序分配更多机器.如果 (total load/# of CPU cores) 大于 1,或者您的内存使用率很高,则您的系统可能容量过剩.但是请继续阅读...

Not enough infrastructure: Maybe you should assign more machines to your application. If (total load / # of CPU cores) is over 1, or your memory usage is high, your system may be over capacity. However keep reading...

静默异常:抛出错误但未在任何地方记录.请求从未完成处理,导致下一个项目.

Silent exception: An error was thrown but not logged anywhere. The request never finished processing, leading to the next item.

资源泄漏:每个请求都需要处理完成.如果您不这样做,连接将保持打开状态.此外,IncomingMesage 对象(又名:通常在 express 代码中称为 req)仍将被其他对象(例如:express 本身)引用.这些对象中的每一个都可以使用大量内存.

Resource leaks: Every request needs to be handled to completion. If you don't do this, the connection will remain open. In addition, the IncomingMesage object (aka: usually called req in express code) will remain referenced by other objects (e.g: express itself). Each one of those objects can use a lot of memory.

节点事件循环饥饿:我会在最后解决这个问题.

Node event loop starvation: I will get to that at the end.

对于内存泄漏,症状是:节点进程将使用越来越多的内存.

For memory leaks, the symptoms would be: the node process would be using an increasing amount of memory.

更糟糕的是,如果可用内存不足并且您的服务器被错误配置为使用交换,Linux 将开始将内存移动到磁盘(交换),这是非常 I/O 和 CPU 密集型的.服务器不应启用交换.

To make things worse, if available memory is low and your server is misconfigured to use swapping, Linux will start moving memory to disk (swapping), which is very I/O and CPU intensive. Servers should not have swapping enabled.

cat /proc/sys/vm/swappiness

将返回系统中配置的交换级别(从 0 到 100).您可以通过 /etc/sysctl.conf 以持久方式修改它(需要重新启动)或使用以下方式以不稳定方式修改它:sysctl vm.swappiness=10

will return you the level of swappiness configured in your system (goes from 0 to 100). You can modify it in a persistent way via /etc/sysctl.conf (requires restart) or in a volatile way using: sysctl vm.swappiness=10

确定内存泄漏后,您需要获取核心转储并下载以进行分析.可以在其他 Stackoverflow 响应中找到一种方法:工具从 Node.js 分析核心转储

Once you've established you have a memory leak, you need to get a core dump and download it for analysis. A way to do that can be found in this other Stackoverflow response: Tools to analyze core dump from Node.js

对于连接泄漏(您通过不处理完成请求而泄漏了连接),您将有越来越多的已建立连接到您的服务器.您可以使用 netstat -a -p tcp | 检查已建立的连接.grep 成立 |wc -l 可用于统计已建立的连接数.

For connection leaks (you leaked a connection by not handling a request to completion), you would be having an increasing number of established connections to your server. You can check your established connections with netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.

现在,事件循环饥饿是最严重的问题.如果您的代码寿命很短,则节点工作得很好.但是如果你做 CPU 密集型的事情并且有一个功能让 CPU 忙了很长时间......比如 50 毫秒(50 毫秒的稳定、阻塞、同步 CPU 时间,而不是需要 50 毫秒的异步代码),操作是由事件循环处理,例如处理 HTTP 请求开始落后并最终超时.

Now, the event loop starvation is the worst problem. If you have short lived code node works very well. But if you do CPU intensive stuff and have a function that keeps the CPU busy for an excessive amount of time... like 50 ms (50 ms of solid, blocking, synchronous CPU time, not asynchronous code taking 50 ms), operations being handled by the event loop such as processing HTTP requests start falling behind and eventually timing out.

查找 CPU 瓶颈的方法是使用性能分析器.nodegrind/qcachegrind 是我首选的分析工具,但其他人更喜欢火焰图等.但是,在生产中运行分析器可能很困难.只需使用一个开发服务器并用请求猛烈抨击它.又名:负载测试.有很多工具可以做到这一点.

The way to find a CPU bottleneck is using a performance profiler. nodegrind/qcachegrind are my preferred profiling tools but others prefer flamegraphs and such. However it can be hard to run a profiler in production. Just take a development server and slam it with requests. aka: a load test. There are many tools for this.

最后,另一种调试问题的方法是:

Finally, another way to debug the problem is:

env NODE_DEBUG=tls,net node <...应用的参数>

node 具有通过 NODE_DEBUG 环境变量启用的可选调试语句.将 NODE_DEBUG 设置为 tls,net 将使节点发出 tls 和 net 模块的调试信息......所以基本上所有发送或接收的内容.如果出现超时,您将看到它的来源.

node has optional debug statements that are enabled through the NODE_DEBUG environment variable. Setting NODE_DEBUG to tls,net will make node emit debugging information for the tls and net modules... so basically everything being sent or received. If there's a timeout you will see where it's coming from.

资料来源:多年维护大型节点服务部署的经验.

Source: Experience of maintaining large deployments of node services for years.

这篇关于Node.js 服务器超时问题(EC2 + Express + PM2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆