Docker容器拒绝在运行命令变成僵尸后被杀死 [英] Docker container refuses to get killed after run command turns into a zombie

查看：4064 发布时间：2017/6/10 20:07:59 linux docker zombie-process lxc

本文介绍了Docker容器拒绝在运行命令变成僵尸后被杀死的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先是第一件事。我的系统信息和版本：

  $ lsb_release -a 
没有LSB模块可用。 
分销商ID：Ubuntu 
说明：Ubuntu 13.04 
发行：13.04 
代号：raring 
 
 $ sudo docker版本
客户端版本：0.9 .0 
 Go version（client）：go1.2.1 
 Git commit（client）：2b3fdf2 
服务器版本：0.9.0 
 Git commit（server）：2b3fdf2 
 Go版本（服务器）：go1.2.1 
 
 $ lxc-version 
 lxc版本：0.9.0 
 
 $ uname -a 
 Linux ip -10-0-2-86 3.8.0-19-generic＃29-Ubuntu SMP Wed Apr 17 18:16:28 UTC 2013 x86_64 x86_64 x86_64 GNU / Linux

我无法在其中的进程成为僵尸之后停止一个容器。升级到docker 0.9.0后，我的服务器上看到了大量的僵尸。例如：

  $ ps axo stat，ppid，pid，comm | grep -w defunct 
 Zl 25327 25332 node< defunct> 
 
 $ pstree -p 
 init（1）─┬
├─sh（819）───docker（831）─┬
├─lxc-start （25327）───节点（25332）───节点（25378）

我可以看到$ code> lxc-start（25327）不调用wait（）在节点进程25332保持僵尸活着。所以我检查了strace在做什么，似乎被卡在一个 epoll_wait 上。 stract首先被卡住，只显示：

  $ sudo strace -ir -ttt -T -v -p 25327 
过程25327附加 - 中断退出（当被要求杀死）
 0.000103 [7fe59b9d34b3] epoll_wait（8，

但是，在我运行一个sudo docker后，我会得到更多的输出：

  0.000103 [7fe59b9d34b3] epoll_wait （8，{{EPOLLIN，{u32 = 21673408，u64 = 21673408}}}，10,4294967295）= 1< 8.935002> 
 8.935097 [7fe59bcaff60] accept（4,0，NULL）= 9& 0.000035> 
 0.000095 [7fe59bcafeb3] fcntl（9，F_SETFD，FD_CLOEXEC）= 0< 0.000027> 
 0.000083 [7fe59b9d401a] setsockopt（9，SOL_SOCKET，SO_PASSCRED，[1]，4）= 0 < 0.000027> 
 0.000089 [7fe59b9d347a] epoll_ctl（8，EPOLL_CTL_ADD，9，{EPOLLIN，{u32 = 21673472，u64 = 21673472}}）= 0< 0.000023> 
 0.000087 [7fe59b9d34b3] epoll_wait（8，{{EPOLLIN，{u32 = 21673472，u64 = 21673472}}}，10,4294967295）= 1< 0.000026> 
 0.000090 [7fe59bcb0130] recvmsg（9，{msg_name（0）= NULL，msg_iov（1）= [{\3\0\0\0\0\0\0\\ \\ 0，8}]，msg_controllen = 32，{cmsg_len = 28，cmsg_level = SOL_SOCKET，cmsg_type = SCM_CREDENTIALS {pid = 773，uid = 0，gid = 0}}，msg_flags = 0}，0）= 8& 0.000034> 
 0.000128 [7fe59bcb019d] sendto（9，\0\0\0\0\0\0\0\0\0\0\0\364b\0\0\0 \\ 0\0\0\0\0\0\0\0\0\0\0，24,0，NULL，0）= 24 <0.000029> 
 0.000090 [7fe59b9d34b3] epoll_wait（8，{{EPOLLIN | EPOLLHUP，{u32 = 21673472，u64 = 21673472}}}，10,4294967295）= 1< 0.000018> 
 0.000091 [7fe59bcb0130] recvmsg（9，{msg_name（0）= NULL，msg_iov（1）= [{\3\0\0\0\0\0\0\\ \\ 0，8}]，msg_controllen = 32，{cmsg_len = 28，cmsg_level = SOL_SOCKET，cmsg_type = SCM_CREDENTIALS {pid = 0，uid = 0，gid = 0}}，msg_flags = 0}，0）= 0 < 0.000026> 
 0.000122 [7fe59b9d347a] epoll_ctl（8，EPOLL_CTL_DEL，9，NULL）= 0< 0.000037> 
 0.000084 [7fe59bcafd00] close（9）= 0< 0.000048> 
 0.000103 [7fe59b9d34b3] epoll_wait（8，{{EPOLLIN，{u32 = 21673408，u64 = 21673408}}}，10,4294967295）= 1< 1.091839> 
 1.091916 [7fe59bcaff60] accept（4，0，NULL）= 9< 0.000035> 
 0.000093 [7fe59bcafeb3] fcntl（9，F_SETFD，FD_CLOEXEC）= 0< 0.000027> 
 0.000083 [7fe59b9d401a] setsockopt（9，SOL_SOCKET，SO_PASSCRED，[1]，4）= 0 <0.000026> 
 0.000090 [7fe59b9d347a] epoll_ctl（8，EPOLL_CTL_ADD，9，{EPOLLIN，{u32 = 21673504，u64 = 21673504}}）= 0 <0.000032> 
 0.000100 [7fe59b9d34b3] epoll_wait（8，{{EPOLLIN，{u32 = 21673504，u64 = 21673504}}}，10,4294967295）= 1< 0.000028> 
 0.000088 [7fe59bcb0130] recvmsg（9，{msg_name（0）= NULL，msg_iov（1）= [{\3\0\0\0\0\0\0\\ \\ 0，8}]，msg_controllen = 32，{cmsg_len = 28，cmsg_level = SOL_SOCKET，cmsg_type = SCM_CREDENTIALS {pid = 774，uid = 0，gid = 0}}，msg_flags = 0}，0）= 8& 0.000030> 
 0.000125 [7fe59bcb019d] sendto（9，\0\0\0\0\0\0\0\0\364b\0\0\0 \\ 0\0\0\0\0\0\0\0\0\0\0，24,0，NULL，0）= 24 <0.000032> 
 0.000119 [7fe59b9d34b3] epoll_wait（8，{{EPOLLIN | EPOLLHUP，{u32 = 21673504，u64 = 21673504}}}，10,4294967295）= 1< 0.000071> 
 0.000139 [7fe59bcb0130] recvmsg（9，{msg_name（0）= NULL，msg_iov（1）= [{\3\0\0\0\0\0\0\\ \\ 0，8}]，msg_controllen = 32，{cmsg_len = 28，cmsg_level = SOL_SOCKET，cmsg_type = SCM_CREDENTIALS {pid = 0，uid = 0，gid = 0}}，msg_flags = 0}，0）= 0 < 0.000018> 
 0.000112 [7fe59b9d347a] epoll_ctl（8，EPOLL_CTL_DEL，9，NULL）= 0 <0.000028> 
 0.000076 [7fe59bcafd00] close（9）= 0< 0.000027> 
 0.000096 [7fe59b9d34b3] epoll_wait（8，

然后我看看epoll_wait在等什么看起来像文件8（我从 epoll_wait（8，{{EPOLLIN，{u32 = 21673408，u64 = 21673408}}}，10，4294967295）= 1< 8.935002> 其形式为 int epoll_wait（int epfd，struct epoll_event * events，int maxevents，int timeout）;

  $ cat / proc / 25327 / fdinfo / 8 
 pos：0 
 flags：02000002 
 tfd：7事件：19数据： 14ab830 
 tfd：4事件：19数据：14ab5c0

还添加7和4基于tfd上面（不知道tfd真的意味着什么）

  $ cat / proc / 25327 / fdinfo / 4 
 pos： 0 
标志：02000002 
 $ cat / proc / 25327 / fdinfo / 7 
 pos：0 
标志：02000002 
 sigmask：fffffffe7ffbfab7 
 $ cd / proc / 25327 / fd 
 $ ls -al 
 lr-x ------ 1根根64 3月13 22:28 0  - > / dev / null 
 lrwx ------ 1根根64 3月13 22:28 1  - > / dev / pts / 17 
 lrwx ------ 1根根64 3月13日22:28 2  - > / dev / pts / 17 
 l-wx ------ 1根根64 3月13日22:28 3  - > /var/log/lxc/3da5764b7bc935896a72abc9371ce68d4d658d8c70b56e1090aacb631080ec0e.log 
 lrwx ------ 1根根64 3月13日22:28 4  - >套接字：[48415] 
 lrwx ------ 1根根64 3月14日00:03 5  - > / dev / ptmx 
 lrwx ------ 1根根64 3月14日00:03 6  - > / dev / pts / 18 
 lrwx ------ 1根根64 3月14日00:03 7  - > anon_inode：[signalfd] 
 lrwx ------ 1根根64 3月14日00:03 8  - > anon_inode：[eventpoll]

关于套接字的信息：

  $ sudo netstat -anp | grep 48415 
 Proto RefCnt标志类型状态I节点PID /程序名称路径
 unix 2 [ACC] STREAM LISTENING 48415 25327 / lxc-start @ / var / lib / lxc / 3da5764b7bc935896a72abc9371ce68d4d658d8c70b56e1090aacb631080ec0e / command $ b $在docker.log中似乎有一个常见的模式，不停止的所有容器都具有此签名：

  2014/03/16 16:33:15容器beb71548b3b23ba3337ca30c6c2efcbfcaf19d4638cf3d5ec5b8a3e4c5f1059a在SIGTERM的0秒内无法退出 - 使用force 
 2014/03/16 16:33:25容器SIGKILL在lxc-kill beb71548b3b2的10秒内无法退出 - 尝试直接SIGKILL

在这一点上，我不知道下一步该怎么做。有关我如何找出导致这些集装箱不退出的建议？我应该收集的其他数据？我也向这个进程发送了一个SIGCHLD，无效。

更多数据：
将日志添加到节点进程的结尾，我们开始使用start命令容器：

  Mon Mar 17 2014 20:52:52 GMT + 0000（UTC）进程：main process =退出代码：0

这里是docker的日志：

  2014/03/17 20:52:52容器f8a3d55e0f ...在SIGTERM的0秒内未能退出 - 使用强制
 2014/03/17 20:53： 02容器SIGKILL在lxc-kill f8a3d55e0fd8的10秒内无法退出 - 尝试直接SIGKILL

时间戳显示进程退出@ 20:52:52

这种情况使用本机和lxc docker驱动程序。

编辑：REPRO STEPS！

将其转换成一个bash脚本，运行并观察几乎50％的容器变成僵尸！

  CNT = 0 
 while true 
 do 
 echo $ CNT 
 DOCK = $（sudo dock呃运行-d -t anandkumarpatel / zombie_bug ./node index.js）
 sleep 60&& sudo码头停车站$ DOCK> out.log& 
 sleep 1 
 CNT = $（（$ CNT + 1））
 if [[$ CNT==50]];然后
 exit 
 fi 
 done

解决方案

更改为最新内核修复问题

发现确切的内核差异：

REPRO：linux-image-3.8.0-31 -generic

NO REPRO：linux-image-3.8.0-32-generic

我认为这是修复：

  +++ linux-3.8.0 / kernel / pid_namespace.c 
 @@ -181,6 +181,7 @@ 
 int nr; 
 int rc; 
 struct task_struct * task，* me = current; 
 + int init_pids = thread_group_leader（me）？ 1：2; 
 
 / *不允许任何进一步的进程到pid命名空间* / 
 disable_pid_allocation（pid_ns）; （;;）中的
 @@ -230,7 +231,7 @@ 
 * / 
 {
 set_current_state（TASK_UNINTERRUPTIBLE）; 
  -  if（pid_ns-> nr_hashed == 1）
 + if（pid_ns-> nr_hashed == init_pids）
 break; 
 schedule（）; 
}

来自这里：
https://groups.google.com/forum/#!msg/fa.linux.kernel/u4b3n4oYDQ4/GuLrXfDIYggJ

要升级我们的所有服务器，这些服务器重新设置，看看是否仍然出现。

first thing first. my system info and versions:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 13.04
Release:    13.04
Codename:   raring

$ sudo docker version
Client version: 0.9.0
Go version (client): go1.2.1
Git commit (client): 2b3fdf2
Server version: 0.9.0
Git commit (server): 2b3fdf2
Go version (server): go1.2.1

$ lxc-version
lxc version: 0.9.0

$ uname -a
Linux ip-10-0-2-86 3.8.0-19-generic #29-Ubuntu SMP Wed Apr 17 18:16:28 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I am not able to stop a container after the process inside of it becomes a zombie. After upgrading to to docker 0.9.0 I was seeing tons of zombies on my server. example:

$ ps axo stat,ppid,pid,comm | grep -w defunct
Zl   25327 25332 node <defunct>

$ pstree -p
init(1)─┬
        ├─sh(819)───docker(831)─┬
                                ├─lxc-start(25327)───node(25332)───{node}(25378)

I can see that lxc-start(25327) not calling wait() on the node process 25332 keeping to zombie alive. So I checked what it was doing with strace and it seemed to be stuck on a epoll_wait. stract actually gets stuck at first and just shows this:

$sudo strace -ir -ttt -T -v -p 25327
Process 25327 attached - interrupt to quit (when asked to kill)
     0.000103 [    7fe59b9d34b3] epoll_wait(8,

but after I run a sudo docker kill 3da5764b7bc9358 I get more output:

 0.000103 [    7fe59b9d34b3] epoll_wait(8, {{EPOLLIN, {u32=21673408, u64=21673408}}}, 10, 4294967295) = 1 <8.935002>
 8.935097 [    7fe59bcaff60] accept(4, 0, NULL) = 9 <0.000035>
 0.000095 [    7fe59bcafeb3] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 <0.000027>
 0.000083 [    7fe59b9d401a] setsockopt(9, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 <0.000027>
 0.000089 [    7fe59b9d347a] epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN, {u32=21673472, u64=21673472}}) = 0 <0.000023>
 0.000087 [    7fe59b9d34b3] epoll_wait(8, {{EPOLLIN, {u32=21673472, u64=21673472}}}, 10, 4294967295) = 1 <0.000026>
 0.000090 [    7fe59bcb0130] recvmsg(9, {msg_name(0)=NULL, msg_iov(1)=[{"\3\0\0\0\0\0\0\0", 8}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=773, uid=0, gid=0}}, msg_flags=0}, 0) = 8 <0.000034>
 0.000128 [    7fe59bcb019d] sendto(9, "\0\0\0\0\0\0\0\0\364b\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24, 0, NULL, 0) = 24 <0.000029>
 0.000090 [    7fe59b9d34b3] epoll_wait(8, {{EPOLLIN|EPOLLHUP, {u32=21673472, u64=21673472}}}, 10, 4294967295) = 1 <0.000018>
 0.000091 [    7fe59bcb0130] recvmsg(9, {msg_name(0)=NULL, msg_iov(1)=[{"\3\0\0\0\0\0\0\0", 8}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=0, uid=0, gid=0}}, msg_flags=0}, 0) = 0 <0.000026>
 0.000122 [    7fe59b9d347a] epoll_ctl(8, EPOLL_CTL_DEL, 9, NULL) = 0 <0.000037>
 0.000084 [    7fe59bcafd00] close(9) = 0 <0.000048>
 0.000103 [    7fe59b9d34b3] epoll_wait(8, {{EPOLLIN, {u32=21673408, u64=21673408}}}, 10, 4294967295) = 1 <1.091839>
 1.091916 [    7fe59bcaff60] accept(4, 0, NULL) = 9 <0.000035>
 0.000093 [    7fe59bcafeb3] fcntl(9, F_SETFD, FD_CLOEXEC) = 0 <0.000027>
 0.000083 [    7fe59b9d401a] setsockopt(9, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0 <0.000026>
 0.000090 [    7fe59b9d347a] epoll_ctl(8, EPOLL_CTL_ADD, 9, {EPOLLIN, {u32=21673504, u64=21673504}}) = 0 <0.000032>
 0.000100 [    7fe59b9d34b3] epoll_wait(8, {{EPOLLIN, {u32=21673504, u64=21673504}}}, 10, 4294967295) = 1 <0.000028>
 0.000088 [    7fe59bcb0130] recvmsg(9, {msg_name(0)=NULL, msg_iov(1)=[{"\3\0\0\0\0\0\0\0", 8}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=774, uid=0, gid=0}}, msg_flags=0}, 0) = 8 <0.000030>
 0.000125 [    7fe59bcb019d] sendto(9, "\0\0\0\0\0\0\0\0\364b\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 24, 0, NULL, 0) = 24 <0.000032>
 0.000119 [    7fe59b9d34b3] epoll_wait(8, {{EPOLLIN|EPOLLHUP, {u32=21673504, u64=21673504}}}, 10, 4294967295) = 1 <0.000071>
 0.000139 [    7fe59bcb0130] recvmsg(9, {msg_name(0)=NULL, msg_iov(1)=[{"\3\0\0\0\0\0\0\0", 8}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=0, uid=0, gid=0}}, msg_flags=0}, 0) = 0 <0.000018>
 0.000112 [    7fe59b9d347a] epoll_ctl(8, EPOLL_CTL_DEL, 9, NULL) = 0 <0.000028>
 0.000076 [    7fe59bcafd00] close(9) = 0 <0.000027>
 0.000096 [    7fe59b9d34b3] epoll_wait(8,

then I looked at what epoll_wait was waiting which looks like file 8 (i am guessing this from epoll_wait(8, {{EPOLLIN, {u32=21673408, u64=21673408}}}, 10, 4294967295) = 1 <8.935002> which is of the form int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

$ cat /proc/25327/fdinfo/8
pos:    0
flags:  02000002
tfd:        7 events:       19 data:          14ab830
tfd:        4 events:       19 data:          14ab5c0

also adding 7 and 4 based on tfd above (not sure what tfd really means)

$ cat /proc/25327/fdinfo/4
pos:    0
flags:  02000002
$ cat /proc/25327/fdinfo/7
pos:    0
flags:  02000002
sigmask:    fffffffe7ffbfab7
$ cd /proc/25327/fd
$ ls -al
lr-x------ 1 root root 64 Mar 13 22:28 0 -> /dev/null
lrwx------ 1 root root 64 Mar 13 22:28 1 -> /dev/pts/17
lrwx------ 1 root root 64 Mar 13 22:28 2 -> /dev/pts/17
l-wx------ 1 root root 64 Mar 13 22:28 3 -> /var/log/lxc/3da5764b7bc935896a72abc9371ce68d4d658d8c70b56e1090aacb631080ec0e.log
lrwx------ 1 root root 64 Mar 13 22:28 4 -> socket:[48415]
lrwx------ 1 root root 64 Mar 14 00:03 5 -> /dev/ptmx
lrwx------ 1 root root 64 Mar 14 00:03 6 -> /dev/pts/18
lrwx------ 1 root root 64 Mar 14 00:03 7 -> anon_inode:[signalfd]
lrwx------ 1 root root 64 Mar 14 00:03 8 -> anon_inode:[eventpoll]

info about socket:

$ sudo netstat -anp | grep 48415
Proto RefCnt Flags       Type       State         I-Node   PID/Program name    Path
unix  2      [ ACC ]     STREAM     LISTENING     48415    25327/lxc-start     @/var/lib/lxc/3da5764b7bc935896a72abc9371ce68d4d658d8c70b56e1090aacb631080ec0e/command

there does seem to be a common pattern in the docker.log all containers that do not stop have this signature:

2014/03/16 16:33:15 Container beb71548b3b23ba3337ca30c6c2efcbfcaf19d4638cf3d5ec5b8a3e4c5f1059a failed to exit within 0 seconds of SIGTERM - using the force
2014/03/16 16:33:25 Container SIGKILL failed to exit within 10 seconds of lxc-kill beb71548b3b2 - trying direct SIGKILL

At this point I have no idea what to do next. any suggestions on how I can find out what is causing these containers not exit? Any other data I should collect? I also sent a SIGCHLD to this process with no avail.

more data: added log to end of the node process we start using the start command in the container:

Mon Mar 17 2014 20:52:52 GMT+0000 (UTC) process: main process = exit code: 0

and here are logs from docker:

2014/03/17 20:52:52 Container f8a3d55e0f... failed to exit within 0 seconds of SIGTERM - using the force
2014/03/17 20:53:02 Container SIGKILL failed to exit within 10 seconds of lxc-kill f8a3d55e0fd8 - trying direct SIGKILL

timestamps show process exited @ 20:52:52

This happens using both native and lxc docker drivers.

EDIT: REPRO STEPS!

turn this into a bash script and run and watch almost 50% of the containers turn into zombies!

CNT=0
while true
do 
  echo $CNT
  DOCK=$(sudo docker run -d -t anandkumarpatel/zombie_bug ./node index.js)
  sleep 60 && sudo docker stop $DOCK > out.log &
  sleep 1
  CNT=$(($CNT+1))
  if [[ "$CNT" == "50" ]]; then
    exit
  fi
done

解决方案

changing to latest kernel fixes the issue

found exact kernel difference:
REPRO: linux-image-3.8.0-31-generic
NO REPRO: linux-image-3.8.0-32-generic

I think this is the fix:

+++ linux-3.8.0/kernel/pid_namespace.c
@@ -181,6 +181,7 @@
    int nr;
    int rc;
    struct task_struct *task, *me = current;
+   int init_pids = thread_group_leader(me) ? 1 : 2;

    /* Don't allow any more processes into the pid namespace */
    disable_pid_allocation(pid_ns);
@@ -230,7 +231,7 @@
     */
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
-       if (pid_ns->nr_hashed == 1)
+       if (pid_ns->nr_hashed == init_pids)
            break;
        schedule();
    }

which came from here: https://groups.google.com/forum/#!msg/fa.linux.kernel/u4b3n4oYDQ4/GuLrXfDIYggJ

going to upgrade all our servers which repro this and see if it still occurs.

这篇关于Docker容器拒绝在运行命令变成僵尸后被杀死的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Docker容器拒绝在运行命令变成僵尸后被杀死 [英] Docker container refuses to get killed after run command turns into a zombie

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

Docker容器拒绝在运行命令变成僵尸后被杀死 [英] Docker container refuses to get killed after run command turns into a zombie

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭