当警报在Perl中跳闸时,我应该如何清理挂起的孙子进程? [英] How should I clean up hung grandchild processes when an alarm trips in Perl?

查看:89
本文介绍了当警报在Perl中跳闸时,我应该如何清理挂起的孙子进程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个并行化的自动化脚本,该脚本需要调用许多其他脚本,其中一些脚本挂起是因为它们(错误地)等待标准输入或等待各种其他不会发生的事情.没什么大不了,因为我抓住了报警的人.诀窍是在子进程关闭时关闭那些挂起的子进程.我认为SIGCHLD,等待和进程组的各种咒语都可以解决问题,但是它们全都被阻止了,孙子也没有收获.

I have a parallelized automation script which needs to call many other scripts, some of which hang because they (incorrectly) wait for standard input or wait around for various other things that aren't going to happen. That's not a big deal because I catch those with alarm. The trick is to shut down those hung grandchild processes when the child shuts down. I thought various incantations of SIGCHLD, waiting, and process groups could do the trick, but they all block and the grandchildren aren't reaped.

我的解决方案有效,但似乎并不是正确的解决方案.我现在对Windows解决方案并不特别感兴趣,但最终我也将需要它.我的仅适用于Unix,目前还可以.

My solution, which works, just doesn't seem like it is the right solution. I'm not especially interested in the Windows solution just yet, but I'll eventually need that too. Mine only works for Unix, which is fine for now.

我写了一个小脚本,其中包含要运行的并行并行子代的数量以及fork的总数:

I wrote a small script that takes the number of simultaneous parallel children to run and the total number of forks:

 $ fork_bomb <parallel jobs> <number of forks>

 $ fork_bomb 8 500

这可能会在几分钟内达到每个用户的进程限制.我发现许多解决方案只是告诉您增加每个用户的进程限制,但是我需要将此解决方案运行约300,000次,因此这行不通.同样,我也不需要重新执行等等清除进程表的建议.我想真正解决问题,而不是在上面打上胶带.

This will probably hit the per-user process limit within a couple of minutes. Many solutions I've found just tell you to increase the per-user process limit, but I need this to run about 300,000 times, so that isn't going to work. Similarly, suggestions to re-exec and so on to clear the process table aren't what I need. I'd like to actually fix the problem instead of slapping duct tape over it.

我检索进程表以查找子进程,并分别在SIGALRM处理程序中关闭挂起的进程,该处理程序需要终止,因为其余的实际代码在此之后没有成功的希望.从性能的角度来看,在进程表中进行的繁琐爬网不会打扰我,但我不介意不这样做:

I crawl the process table looking for the child processes and shut down the hung processes individually in the SIGALRM handler, which needs to die because the rest of real code has no hope of success after that. The kludgey crawl through the process table doesn't bother me from a performance perspective, but I wouldn't mind not doing it:

use Parallel::ForkManager;
use Proc::ProcessTable;

my $pm = Parallel::ForkManager->new( $ARGV[0] );

my $alarm_sub = sub {
        kill 9,
            map  { $_->{pid} }
            grep { $_->{ppid} == $$ }
            @{ Proc::ProcessTable->new->table }; 

        die "Alarm rang for $$!\n";
        };

foreach ( 0 .. $ARGV[1] ) 
    {
    print ".";
    print "\n" unless $count++ % 50;

    my $pid = $pm->start and next; 

    local $SIG{ALRM} = $alarm_sub;

    eval {
        alarm( 2 );
        system "$^X -le '<STDIN>'"; # this will hang
        alarm( 0 );
        };

    $pm->finish;
    }

如果要用尽进程,请删除 kill .

If you want to run out of processes, take out the kill.

我认为设置一个进程组是可行的,这样我就可以一起杀死所有东西,但这会阻止:

I thought that setting a process group would work so I could kill everything together, but that blocks:

my $alarm_sub = sub {
        kill 9, -$$;    # blocks here
        die "Alarm rang for $$!\n";
        };

foreach ( 0 .. $ARGV[1] ) 
    {
    print ".";
    print "\n" unless $count++ % 50;

    my $pid = $pm->start and next; 
    setpgrp(0, 0);

    local $SIG{ALRM} = $alarm_sub;

    eval {
        alarm( 2 );
        system "$^X -le '<STDIN>'"; # this will hang
        alarm( 0 );
        };

    $pm->finish;
    }

使用 POSIX setsid也不起作用,并且我认为这实际上以不同的方式破坏了事情,因为我并没有真正守护它.

The same thing with POSIX's setsid didn't work either, and I think that actually broke things in a different way since I'm not really daemonizing this.

奇怪的是, Parallel :: ForkManager run_on_finish发生得太晚了相同的清理代码:孙子显然已经与子进程分离了.

Curiously, Parallel::ForkManager's run_on_finish happens too late for the same clean-up code: the grandchildren are apparently already disassociated from the child processes at that point.

推荐答案

我已经读过几次这个问题了,我想我能从中得到什么 正在尝试做.您有一个控制脚本.该脚本产生 孩子们做一些事情,这些孩子产生了孙子孙女 实际去做.问题是孙子孙女可以 太慢了(等待STDIN等),您想杀死它们. 此外,如果有一个慢孙子,您需要整个 要死的孩子(如果可能,杀死其他子孙).

I've read the question a few times, and I think I sort of get what you are trying to do. You have a control script. This script spawns children to do some stuff, and these children spawn the grandchildren to actually do the work. The problem is that the grandchildren can be too slow (waiting for STDIN, or whatever), and you want to kill them. Furthermore, if there is one slow grandchild, you want the entire child to die (killing the other grandchildren, if possible).

因此,我尝试实现这两种方式.首先是使 父级在新的UNIX会话中产生一个子级,为其中的几个设置一个计时器 秒,并在计时器关闭时终止整个子会话. 这使得父母对孩子和孩子都负有责任. 孙子.它也不能正常工作.

So, I tried implementing this two ways. The first was to make the parent spawn a child in a new UNIX session, set a timer for a few seconds, and kill the entire child session when the timer went off. This made the parent responsible for both the child and the grandchildren. It also didn't work right.

下一个策略是让父母产下孩子,然后 让孩子负责管理孙辈.它会 为每个孙子设置一个计时器,如果进程没有执行,则将其终止 到期时间退出.效果很好,因此代码如下.

The next strategy was to make the parent spawn the child, and then make the child responsible for managing the grandchildren. It would set a timer for each grandchild, and kill it if the process hadn't exited by expiration time. This works great, so here is the code.

我们将使用EV来管理子代和计时器,并使用AnyEvent来管理 API. (您可以尝试另一个AnyEvent事件循环,例如Event或POE. 但我知道EV可以正确处理儿童退出的情况 在告诉循环进行监视之前,消除了烦人的比赛 其他循环易受攻击的条件.)

We'll use EV to manage the children and timers, and AnyEvent for the API. (You can try another AnyEvent event loop, like Event or POE. But I know that EV correctly handles the condition where a child exits before you tell the loop to monitor it, which eliminates annoying race conditions that other loops are vulnerable to.)

#!/usr/bin/env perl

use strict;
use warnings;
use feature ':5.10';

use AnyEvent;
use EV; # you need EV for the best child-handling abilities

我们需要跟踪儿童观察者:

We need to keep track of the child watchers:

# active child watchers
my %children;

然后,我们需要编写一个函数来启动子级.这些事 父母的产卵被称为孩子,孩子的东西 产生称为工作.

Then we need to write a function to start the children. The things the parent spawns are called children, and the things the children spawn are called jobs.

sub start_child($$@) {
    my ($on_success, $on_error, @jobs) = @_;

参数是在子项完成时要调用的回调 成功(表示其工作也成功),在以下情况时进行回调 那个孩子没有成功完成,然后列出了一个coderef 要运行的作业.

The arguments are a callback to be called when the child completes successfully (meaning its jobs were also a success), a callback when the child did not complete successfully, and then a list of coderef jobs to run.

在此功能中,我们需要进行分叉.在父母中,我们建立了一个孩子 观察者监视孩子:

In this function, we need to fork. In the parent, we setup a child watcher to monitor the child:

    if(my $pid = fork){ # parent
        # monitor the child process, inform our callback of error or success
        say "$$: Starting child process $pid";
        $children{$pid} = AnyEvent->child( pid => $pid, cb => sub {
            my ($pid, $status) = @_;
            delete $children{$pid};

            say "$$: Child $pid exited with status $status";
            if($status == 0){
                $on_success->($pid);
            }
            else {
                $on_error->($pid);
            }
        });
    }

在孩子中,我们实际上负责工作.这涉及一点 设置.

In the child, we actually run the jobs. This involves a little bit of setup, though.

首先,我们忘记了父母的孩子看护者,因为这并不能 让孩子知道其兄弟姐妹即将离开的感觉. (叉是 很有趣,因为您继承了父母的所有状态,即使 完全没有道理.)

First, we forget the parent's child watchers, because it doesn't make sense for the child to be informed of its siblings exiting. (Fork is fun, because you inherit all of the parent's state, even when that makes no sense at all.)

    else { # child
        # kill the inherited child watchers
        %children = ();
        my %timers;

我们还需要知道所有工作何时完成以及是否 他们都是成功的.我们使用计数条件变量来 确定何时一切都已退出.我们在启动时增加,并且 退出时递减,当计数为0时,我们知道一切都已完成.

We also need to know when all the jobs are done, and whether or not they were all a success. We use a counting conditional variable to determine when everything has exited. We increment on startup, and decrement on exit, and when the count is 0, we know everything's done.

我还保留一个布尔值来指示错误状态.如果一个过程 以非零状态退出,错误为1.否则,它保持为0. 您可能想要保持更多状态:)

I also keep a boolean around to indicate error state. If a process exits with a non-zero status, error goes to 1. Otherwise, it stays 0. You might want to keep more state than this :)

        # then start the kids
        my $done = AnyEvent->condvar;
        my $error = 0;

        $done->begin;

(我们也从1开始计数,所以如果有0个工作,我们的流程 仍然退出.)

(We also start the count at 1 so that if there are 0 jobs, our process still exits.)

现在,我们需要为每个作业分叉,然后运行该作业.在父母中,我们 做一些事情.我们增加condvar.我们设置了一个计时器来杀死 如果孩子太慢的话.我们设置了一个儿童看守者,因此我们可以 获知工作的退出状态.

Now we need to fork for each job, and run the job. In the parent, we do a few things. We increment the condvar. We set a timer to kill the child if it's too slow. And we setup a child watcher, so we can be informed of the job's exit status.

    for my $job (@jobs) {
            if(my $pid = fork){
                say "[c] $$: starting job $job in $pid";
                $done->begin;

                # this is the timer that will kill the slow children
                $timers{$pid} = AnyEvent->timer( after => 3, interval => 0, cb => sub {
                    delete $timers{$pid};

                    say "[c] $$: Killing $pid: too slow";
                    kill 9, $pid;
                });

                # this monitors the children and cancels the timer if
                # it exits soon enough
                $children{$pid} = AnyEvent->child( pid => $pid, cb => sub {
                    my ($pid, $status) = @_;
                    delete $timers{$pid};
                    delete $children{$pid};

                    say "[c] [j] $$: job $pid exited with status $status";
                    $error ||= ($status != 0);
                    $done->end;
                });
            }

使用计时器比闹钟要容易一些,因为它可以携带 声明它.每个计时器都知道要杀死哪个进程,这很容易 在流程成功退出时取消计时器-我们只是 从哈希中删除它.

Using the timer is a little bit easier than alarm, since it carries state with it. Each timer knows which process to kill, and it's easy to cancel the timer when the process exits successfully -- we just delete it from the hash.

那是(孩子的)父母. (该孩子的;或 工作)非常简单:

That's the parent (of the child). The child (of the child; or the job) is really simple:

            else {
                # run kid
                $job->();
                exit 0; # just in case
            }

如果愿意,您也可以在此处关闭stdin.

You could also close stdin here, if you wanted to.

现在,在生成所有进程之后,我们等待它们执行 通过等待condvar退出.事件循环将监控 孩子和计时器,并为我们做正确的事:

Now, after all the processes have been spawned, we wait for them to all exit by waiting on the condvar. The event loop will monior the children and timers, and do the right thing for us:

        } # this is the end of the for @jobs loop
        $done->end;

        # block until all children have exited
        $done->recv;

然后,当所有孩子都退出后,我们可以进行任何清理工作 我们想要的工作,例如:

Then, when all the children have exited, we can do whatever cleanup work we want, like:

        if($error){
            say "[c] $$: One of your children died.";
            exit 1;
        }
        else {
            say "[c] $$: All jobs completed successfully.";
            exit 0;
        }
    } # end of "else { # child"
} # end of start_child

好,那就是孩子和孙子/工作.现在我们只需要写 父母,这要容易得多.

OK, so that's the child and grandchild/job. Now we just need to write the parent, which is a lot easier.

像孩子一样,我们将使用计数condvar等待我们的 孩子们.

Like the child, we are going to use a counting condvar to wait for our children.

# main program
my $all_done = AnyEvent->condvar;

我们需要做一些工作.这是永远成功的 如果您按回车键将成功,但如果您按一次将失败 只是让它被计时器杀死:

We need some jobs to do. Here's one that is always successful, and one that will be successful if you press return, but will fail if you just let it be killed by the timer:

my $good_grandchild = sub {
    exit 0;
};

my $bad_grandchild = sub {
    my $line = <STDIN>;
    exit 0;
};

因此,我们只需要启动子作业即可.如果你还记得路 回到start_child的顶部,它需要两个回调,一个错误 回调和成功回调.我们将设置这些;错误 回调将打印"not ok"并减小condvar,然后 成功回调将打印确定"并执行相同的操作.很简单.

So then we just need to start the child jobs. If you remember way back to the top of start_child, it takes two callbacks, an error callback, and a success callback. We'll set those up; the error callback will print "not ok" and decrement the condvar, and the success callback will print "ok" and do the same. Very simple.

my $ok  = sub { $all_done->end; say "$$: $_[0] ok" };
my $nok = sub { $all_done->end; say "$$: $_[0] not ok" };

然后,我们可以开始为更多的子孙生一堆孩子 职位:

Then we can start a bunch of children with even more grandchildren jobs:

say "starting...";

$all_done->begin for 1..4;
start_child $ok, $nok, ($good_grandchild, $good_grandchild, $good_grandchild);
start_child $ok, $nok, ($good_grandchild, $good_grandchild, $bad_grandchild);
start_child $ok, $nok, ($bad_grandchild, $bad_grandchild, $bad_grandchild);
start_child $ok, $nok, ($good_grandchild, $good_grandchild, $good_grandchild, $good_grandchild);

其中两个将超时,两个将成功.如果按回车 但是,当它们运行时,它们都可能会成功.

Two of those will timeout, and two will succeed. If you press enter while they're running, though, then they might all succeed.

无论如何,一旦这些开始,我们只需要等待它们 完成:

Anyway, once those have started, we just need to wait for them to finish:

$all_done->recv;

say "...done";

exit 0;

这就是程序.

Parallel :: ForkManager所没有做的一件事是 限制速率"我们的分叉,以便只有n个子代在 时间.但是,这很容易手动实现:

One thing that we aren't doing that Parallel::ForkManager does is "rate limiting" our forks so that only n children are running at a time. This is pretty easy to manually implement, though:

 use Coro;
 use AnyEvent::Subprocess; # better abstraction than manually
                           # forking and making watchers
 use Coro::Semaphore;

 my $job = AnyEvent::Subprocess->new(
    on_completion => sub {}, # replace later
    code          => sub { the child process };
 )

 my $rate_limit = Coro::Semaphore->new(3); # 3 procs at a time

 my @coros = map { async {
     my $guard = $rate_limit->guard;
     $job->clone( on_completion => Coro::rouse_cb )->run($_);
     Coro::rouse_wait;
 }} ({ args => 'for first job' }, { args => 'for second job' }, ... );

 # this waits for all jobs to complete
 my @results = map { $_->join } @coros;

这里的优势是您可以在孩子们做其他事情的同时 正在运行-只需使用async生成更多线程,然后再执行 阻止加入.您对孩子也有更多的控制权 通过AnyEvent :: Subprocess-您可以在Pty中运行子项并供稿 它的stdin(与Expect一样),您可以捕获其stdin和stdout 和stderr,或者您可以忽略这些东西,或者其他任何东西.你去 决定,而不是某个试图使事情简单"的模块作者.

The advantage here is that you can do other things while your children are running -- just spawn more threads with async before you do the blocking join. You also have a lot more control over the children with AnyEvent::Subprocess -- you can run the child in a Pty and feed it stdin (like with Expect), and you can capture its stdin and stdout and stderr, or you can ignore those things, or whatever. You get to decide, not some module author that's trying to make things "simple".

无论如何,希望这会有所帮助.

Anyway, hope this helps.

这篇关于当警报在Perl中跳闸时,我应该如何清理挂起的孙子进程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆