进程停留在出口,显示为僵尸,但无法获得 [英] Process stuck in exit, shows as zombie but cannot be reaped

查看:74
本文介绍了进程停留在出口,显示为僵尸,但无法获得的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个受其父公司监视的进程.该子项遇到错误,导致其调用abort.该过程不会篡改中止过程,因此应按预期进行(转储核心,终止).父母应该检测孩子的终止并触发一系列事件以响应失败.该子线程是多线程且复杂.

I have a process that's monitored by its parent. The child encountered an error that caused it to call abort. The process does not tamper with the abort process, so it should proceed as expected (dump core, terminate). The parent is supposed to detect the child's termination and trigger a series of events to respond to the failure. The child is multi-threaded and complex.

这是我从ps中看到的内容:

Here's what I see from ps:

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0  1000  4929  1272  20   0  85440  6792 wait   S+   pts/2      0:00 rxd
1  1000  4930  4929  20   0      0     0 exit   Zl+  pts/2     38:21 [rxd] <defunct>

因此,孩子(4930)已终止.这是一个僵尸.我无法按预期附加到它.但是,父母(4929)仍然处于以下状态:

So the child (4930) has terminated. It is a zombie. I cannot attach to it, as expected. However, the parent (4929) stays blocked in:

int i;
// ...
waitpid (-1, &i, 0);

因此,似乎孩子是一个僵尸,但不知何故,它还没有完成父母收割所需的一切.我认为exitWCHAN字段是一个有价值的线索.

So it seems like the child is a zombie but somehow has not completed everything necessary for its parent to reap it. The WCHAN field of exit is, I think, a valuable clue.

该平台是64位Linux,Ubuntu 13.04,内核3.8.0-30.这个孩子似乎没有丢掉核心或做任何事情.我已经离开系统几分钟了,什么都没改变.

The platform is 64-bit Linux, Ubuntu 13.04, kernel 3.8.0-30. The child doesn't appear to be dumping core or doing anything. I've left the system for several minutes and nothing changed.

有人有什么主意可能导致此问题或对此我该怎么办?

Does anyone have any ideas what might be causing this or what I can do about it?

更新:另一个有趣的信息-如果我kill -9父进程,子进程就消失了.这有点令人困惑,因为父进程是微不足道的,只是在waitpid中进行了阻塞.另外,发生此问题时,我没有从孩子那里得到任何核心转储.

Update: Another interesting bit of information -- if I kill -9 the parent process, the child goes away. This is kind of baffling, since the parent process is trivial, just blocking in waitpid. Also, I don't get any core dump (from the child) when this problem happens.

更新:似乎孩子被卡在schedule中,从exit_mm调用,从do_exit调用.我不知道为什么exit_mm会调用schedule.而且我想知道为什么杀死父母会解开它.

Update: It seems the child is stuck in schedule, called from exit_mm, called from do_exit. I wonder why exit_mm would call schedule. And I wonder why killing the parent would unstick it.

推荐答案

我终于明白了!这个过程实际上一直在做有用的工作.该过程保留了对 slow 文件系统上的 large 文件的最后引用.进程终止时,将释放对该文件的最后一个引用,从而迫使OS回收空间.该文件太大,以至于需要数万次I/O操作,耗时10分钟或更长时间.

I finally figured it out! The process was actually doing useful work all this time. The process held the last reference to a large file on a slow filesystem. When the process terminates, the last reference to the file is release, forcing the OS to reclaim the space. The file was so large that this required tens of thousands of I/O operations, taking 10 minutes or more.

这篇关于进程停留在出口,显示为僵尸,但无法获得的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆