计时一个分叉的过程 [英] Timing out a forked process

查看:71
本文介绍了计时一个分叉的过程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在多处理器上运行蒙特卡洛,但是挂了很多.因此,我将这段Perl代码组合在一起,以杀死挂起蒙特卡洛的迭代并转到下一个迭代.但是我遇到了一些错误,我还没有弄清楚. 我认为它睡得太久了,它将在查找之前删除out.mt0文件. 这是代码:

I am running a Monte carlo on Multiple processors, but it hangs up a lot. So I put together this perl code to kill the iteration that hangs up the monte carlo and go to the next iteration. But I get some errors, I have not figure out yet. I think it sleeps too long and it will delete the out.mt0 file before it will look for it. This is the code:

my $pid = fork();
die "Could not fork\n" if not defined $pid;

if ($pid == 0) {
    print "In child\n";   
    system("hspice -i mont_read.sp -o out -mt 4"); wait;
    sleep(.8); wait;
    exit(0);
}

print "In parent \n";

$i = 0;    
$mont_number = $j - 1;

out: while (1) {
    $res = waitpid($pid, WNOHANG);    
    if ($res == -1) {
        print "Successful Exit Process Detected\n";
        system("mv out.mt0 mont_read.mt0"); wait;
        sleep(1); wait;
        system("perl monte_stat.pl > rel_out.txt"); wait ;
        system("cat stat_result.txt rel_out.txt > stat_result.tmp"); wait; 
        system("mv stat_result.tmp stat_result.txt"); wait;
        print "\nSim #$mont_number complete\n"; wait;
        last out;    
    }

    if ($res != -1) {    
        if ($i >= $timeout) {
            $hang_count = $hang_count+1;
            system("killall hspice"); wait;
            sleep(1);
            print("time_out complete\n"); wait;
            print "\nSim #$mont_number complete\n"; wait;
            last out; 
        }

        if ($i < $timeout) {
            sleep $slept; wait;
        }
        $i = $i+1;
    }
}

这是错误:


Illegal division by zero at monte_stat.pl line 73,  line 2.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 1.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 1.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73.
mv: cannot stat `out.mt0': No such file or directory
mv: cannot stat `out.mt0': No such file or directory
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 3.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73,  line 1.
mv: cannot stat `out.mt0': No such file or directory

谁能给我一个在哪里调试的想法. 谢谢

Could anyone give me an idea where to look to debug it. thanks

推荐答案

根据错误,您的hslice似乎崩溃了.但是还有其他问题.

According to the errors it appears that your hslice is crashing. But there are other issues.

这首先是一个与您的代码尽可能接近的工作示例.

Here is first a working example as close as possible to your code.

use warnings;
use strict;
use feature 'say';
use POSIX qw(:sys_wait_h);
$| = 1;

my ($timeout, $duration, $sleep_time) = (5, 10, 1);

my $pid = fork // die "Can't fork: $!";

if ($pid == 0)  
{
    exec "echo JOB STARTS; sleep $duration; echo JOB DONE";
    die "exec shouldn't return: $!";
}    
say "Started $pid";
sleep 1;

my $tot_sec;    
while (1) 
{
    my $ret = waitpid $pid, WNOHANG;

    if    ($ret > 0) { say "Child $ret exited with: $?";  last; }
    elsif ($ret < 0) { say "\nNo such process ($ret)";    last; }
    else             { print " . " }

    sleep $sleep_time;

    if (($tot_sec += $sleep_time) > $timeout) {
        say "\nTimeout. Send 15 (SIGTERM) signal to the process.";
        kill 15, $pid;
        last;
    }   
}

将(作业的)$duration设置为3,比$timeout短,我们得到

With $duration (of the job) set to 3, shorter than $timeout, we get


Started 16848
JOB STARTS
 .  .  . JOB DONE
Child (JOB) 16848 exited with: 0

$duration设置为10时,我们得到


Started 16550
JOB STARTS
 .  .  .  .  .
Timeout. Send 15 (SIGTERM) signal to the process.

并且工作被终止(再等待5秒钟-n JOB DONE不应该出现).

and the job is killed (wait for 5 more seconds – the JOB DONE shouldn't show up).

对问题代码的评论

  • 如果您fork仅运行作业,则没有理由使用system.只需 exec 该程序

  • If you fork only to run a job there is no reason for system. Just exec that program

system之后不需要等待,这是错误的. system包含等待时间

No need for wait after system, and it's wrong. The system includes a wait

wait不属于printsleep,这是错误的

The wait doesn't belong after print and sleep, and it's wrong

无需为了终止进程而掏空killall

No need to shell out for killall in order to kill a process

如果最终使用system,则该程序将在具有另一个PID的新进程中运行.然后,需要更多的时间来找到该PID并将其杀死.参见 Proc :: ProcessTable

If you end up using system the program will run in a new process with another PID. Then more is needed to find that PID and kill it. See Proc::ProcessTable and this post, for example

上面的代码需要检查该进程是否确实被终止

The code above needs checks of whether the process was indeed killed

替换您的命令行而不是echo ...并根据需要添加对它的检查.

Substitute your command line instead of echo ... and add checks for it as needed.

另一种选择是简单地睡眠一段$timeout时间,然后检查作业是否完成(孩子退出了).但是,使用您的方法可以在轮询时执行其他操作.

Another option is to simply sleep for a $timeout period and then check whether the job is done (child exited). However, with your approach you can do other things while polling.

另一种选择是使用警报.

这篇关于计时一个分叉的过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆