为什么在负载测试期间MongoDB没有响应? [英] Why is MongoDB unresponsive during a load test?

查看:70
本文介绍了为什么在负载测试期间MongoDB没有响应?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到MongoDB变得无响应并拒绝生产中的连接的问题.

I have an issue with MongoDB becoming unresponsive and refusing connections in production.

在测试中,我可以通过向我的应用发送大量请求来始终如一地重现锁定".

In testing, I'm able to consistently reproduce the "lock out" by sending lot of requests to my app.

这是mongostat的示例输出:

Here is an example output from mongostat:

     0      0      0      0       0       1       0   1.3g   1.8g   177m      0        0          0       0|0     0|0    62b     1k    18   19:40:11
     0      0      0      0       0       1       0   1.3g   1.8g   177m      0        0          0       0|0     0|0    62b     1k    18   19:40:12
--------------LOAD TEST BEGINS-----------------------   
     1     56      0      0       1      58       0   1.3g  1.81g   177m      0        0          0       0|0     0|0    10k   302k    18   19:40:13
    10    116      0      0       2     127       0   1.3g  1.81g   177m      0        0          0       0|0     1|0    32k   263k    19   19:40:14
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   206    431      0      0       2     638       0   1.3g  1.82g   177m      0      1.1          0       0|0     0|0   206k   339k    22   19:40:15
   331    649      0      0       2     983       0   1.3g  1.81g   177m      0        1          0       0|0     0|0   317k   521k    21   19:40:16
   367    723      0      0      10    1090       0   1.3g  1.81g   177m      0      0.9          0       0|1     1|1   297k     1m    20   19:40:17
   500    961      0      0      11    1461       0   1.3g  1.81g   177m      0      1.2          0       0|0     0|0   390k     1m    19   19:40:18
   543   1073      0      0      10    1618       0   1.3g  1.82g   177m      0      1.2          0       0|0     2|0   433k     1m    21   19:40:19
   461    909      0      0      11    1370       0   1.3g  1.81g   177m      0      1.1          0       0|0     2|0   365k     1m    20   19:40:20
   347    685      0      0       6    1033       0   1.3g  1.81g   177m      0      0.8          0       0|0     1|0   411k   962k    21   19:40:21
   482    938      0      0       9    1421       0   1.3g  1.81g   177m      0      1.1          0       0|0     0|0   492k     1m    20   19:40:22
   444    877      0      0      10    1322       0   1.3g  1.81g   177m      0      1.1          0       0|0     0|1   479k     1m    22   19:40:23
   528   1019      0      0       9    1549       0   1.3g  1.81g   177m      0      1.2          0       0|0     0|0   446k     1m    21   19:40:24
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   520   1033      0      0      10    1554       0   1.3g  1.81g   177m      0      1.2          0       0|0     1|0   418k     1m    22   19:40:25
   448    878      0      0      10    1326       0   1.3g  1.81g   177m      0      1.2          0       0|0     1|0   349k     1m    22   19:40:26
   503    979      0      0       8    1483       0   1.3g  1.82g   177m      0      1.3          0       0|0     1|0   397k     1m    21   19:40:27
   379    743      0      0      10    1123       0   1.3g  1.81g   177m      0        1          0       0|0     0|0   315k     1m    22   19:40:28
   456    896      0      0       7    1354       0   1.3g  1.81g   177m      0      1.1          0       0|0     0|0   436k     1m    22   19:40:29
   371    715      0      0      10    1086       0   1.3g  1.81g   175m      0      0.9          0       0|0     0|0   299k     1m    20   19:40:30
   505    993      0      0       7    1499       0   1.3g  1.81g   175m      0      1.2          0       0|0     2|0   394k     1m    21   19:40:31
   493   1023      0      0      10    1519       0   1.3g  1.82g   175m      0      1.1          0       0|0     1|0   392k     1m    22   19:40:32
   426    868      0      0      11    1295       0   1.3g  1.81g   170m      0        1          0       0|0     1|0   363k     1m    21   19:40:33
   352    772      0      0       7    1125       0   1.3g  1.81g   163m     19      4.4          0       0|1     0|1   368k     1m    22   19:40:34
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   330    656      0      0       5     985       0   1.3g  1.81g   159m      4      1.5          0       0|0     0|0   325k   876k    19   19:40:35
   398    817      0      0       8    1218       0   1.3g  1.81g   155m      9      1.3          0       0|0     1|0   372k     1m    21   19:40:36
   464   1015      0      0       8    1479       0   1.3g  1.81g   155m      2      1.2          0       0|0     1|0   381k     1m    21   19:40:37
   457    915      0      0      11    1372       0   1.3g  1.81g   155m      0      1.1          0       0|0     0|0   356k     1m    22   19:40:38
   458    960      0      0       7    1419       0   1.3g  1.83g   155m      0      1.1          0       0|0     0|0   376k     1m    22   19:40:39
   391    760      0      0      11    1153       0   1.3g  1.81g   155m      0        1          0       0|0     1|0   321k     1m    21   19:40:40
   315    639      0      0       5     956       0   1.3g  1.81g   155m      0      0.8          0       0|0     0|0   325k   877k    21   19:40:41
   303    696      0      0       7     998       0   1.3g  1.81g   155m      0      0.8          0       0|0     0|0   279k     1m    21   19:40:42
   388    772      0      0       8    1163       0   1.3g  1.81g   155m      0        1          0       0|0     0|0   308k     1m    20   19:40:43
   273    623      0      0       7     895       0   1.3g  1.81g   155m      0      0.7          0       0|0     2|0   232k     1m    20   19:40:44
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   397    796      0      0       7    1194       0   1.3g  1.81g   155m      0      0.9          0       0|0     0|0   311k     1m    19   19:40:45
   347    754      0      0       4    1103       0   1.3g  1.81g   155m      0      0.9          0       0|0     0|0   329k   771k    20   19:40:46
   336    700      0      0       8    1042       0   1.3g  1.82g   160m     20      0.8          0       2|2     3|2   370k     1m    23   19:40:47
   272    525      0      0       5     805       0   1.3g  1.79g   167m    188      0.6          0       0|0     0|0   306k   818k    18   19:40:48
   103    246      0      0       1     350       0   1.3g  1.78g   166m      0      0.4          0       0|0     0|0   136k   234k    15   19:40:49
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:50
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:51
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:52
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:53
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:54
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:55
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:56
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:57
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:58
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:59
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:00
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:01
     0      3      0      0       0       4       0   1.3g  1.78g   166m      0        0          0       0|0     0|0   775b     4k    15   19:41:02
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:03
     0      0      0      0       0       1       1   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:04
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:05
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:06
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:07
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:08
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:09
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:10
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:11
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:12
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:13
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:14
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:15
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:16
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:17
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:18
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:19
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:20
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:21
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:22
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:23
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:24
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:25
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:26
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:27
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:28
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:29
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:30
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:31
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:32
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:33
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:34
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:35
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:36
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:37
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:38
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:39
     0     28      0      0       0      29       0   1.3g  1.78g   166m      0        0          0       0|0     0|0     5k    27k    15   19:41:40
    86    234      0      0       1     322       0   1.3g  1.79g   166m      0      0.2          0       0|0     1|0   145k   217k    17   19:41:41
   216    445      0      0       5     663       0   1.3g  1.79g   167m      0      0.5          0       0|0     0|0   220k   807k    16   19:41:42
   302    612      0      0       6     916       0   1.3g  1.79g   167m      0      0.6          0       0|0     0|0   236k   950k    16   19:41:43
   113    247      0      0       1     362       0   1.3g  1.79g   167m      0      0.2          0       0|0     0|0    95k   192k    17   19:41:44
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    38     63      0      0       2     103       0   1.3g  1.79g   167m      0        0          0       0|0     0|0    26k   289k    16   19:41:45
--------------LOAD TEST ENDS----------------------- 
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:46
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:47
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:48
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:49

如您所见,在负载测试的中间,读写几乎没有任何下降.在这段时间内,我无法通过mongo或mongostat在本地连接到MongoDB(连接被拒绝).您还可以看到,连接数非常低,mongo的最大连接数为16000.

As you can see, in the middle of the load test, the reads and writes drop to pretty much nothing. During that time, I'm unable to connect locally to MongoDB via mongo or mongostat (connection refused). As you can also see, the connections is pretty low and there are 16000 connections in the max connections for mongo.

我能够始终如一地重现这一点.但是,每次锁定"都是不同的长度.我观察到51秒,20秒,28秒.在每种情况下,我仍然通过mongostat连接到mongo(并接收输出),但是新连接将被拒绝.

I'm able to reproduce this consistently. However, each time the "lock out" is a different length. I've observed 51 seconds, 20 seconds, 28 seconds. In each instance, I'm still connected to mongo via mongostat (and receiving output), but new connections will be refused.

user@myhostname:~# mongostat
couldn't connect to [127.0.0.1] couldn't connect to server 127.0.0.1

可能是什么问题?我正在运行MongoDB v1.8.2.

What could be the issue? I'm running MongoDB v1.8.2.

顺便说一句,我能够重现类似的问题,其中命令不会完全消失,但数量很少且波动很大.

As an aside, I'm able to reproduce a similar issue, where the commands don't drop off entirely, but are low in volume and fluctuate wildly.

推荐答案

已解决:

sudo sysctl net.ipv4.tcp_tw_reuse=1

然后,重新启动mongo.

Then, restart mongo.

或者,您可以将其添加到/etc/sysctl.conf中(这样它将在重新启动时运行):

Alternatively, you can add it to /etc/sysctl.conf (so it will be run on reboot):

net.ipv4.tcp_tw_reuse=1

然后运行此程序以重新加载(无需重新启动)

Then run this to reload (without having to reboot)

sudo sysctl -p /etc/sysctl.conf

此修复"将禁用TCP套接字(服务器范围)的timewait状态.因此,这根本不是一个解决办法.但是,直到mongo使用SO_LINGER减少其timewait状态之前,大量服务器套接字将以TIME_WAIT状态集中,并且对于新连接仍然不可用.您可以通过以下方式在TIME_WAIT中查看连接计数:

This "fix" will disable the timewait state for TCP sockets (server-wide). So, it's really not a fix at all. However, until mongo reduces their timewait state using SO_LINGER, a large number of server sockets will pool up in the TIME_WAIT state and remain unusable for new connections. You can see the count of connections in TIME_WAIT with this:

netstat -an | grep TIME_WAIT | wc -l

这样,我能够看到它在大约28k TIME_WAIT连接失败.使用此内核标志:

With that, I was able to see it fail at around 28k TIME_WAIT connections. Using this kernel flag:

sysctl net.ipv4.ip_local_port_range="18000 65535"

服务器在45k连接时失败.因此,为了更轻松地重现错误,您可以将范围降低到200或小一些.

The server fails at 45k connections. So, to reproduce the error more easily, you could lower the range to 200 or something small.

所以,这个的结果毕竟是一个编程问题(如您在最后一个链接中看到的那样):

So, the result of this was a programming question after all (as you can see from the last link):

TCP选项SO_LINGER(零)-当它必需

http://alas.matf.bg.ac .rs/manuals/lspe/snode = 105.html

这篇关于为什么在负载测试期间MongoDB没有响应?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆