计时一个分叉的过程 [英] Timing out a forked process
问题描述
我在多处理器上运行蒙特卡洛,但是挂了很多.因此,我将这段Perl代码组合在一起,以杀死挂起蒙特卡洛的迭代并转到下一个迭代.但是我遇到了一些错误,我还没有弄清楚. 我认为它睡得太久了,它将在查找之前删除out.mt0文件. 这是代码:
I am running a Monte carlo on Multiple processors, but it hangs up a lot. So I put together this perl code to kill the iteration that hangs up the monte carlo and go to the next iteration. But I get some errors, I have not figure out yet. I think it sleeps too long and it will delete the out.mt0 file before it will look for it. This is the code:
my $pid = fork();
die "Could not fork\n" if not defined $pid;
if ($pid == 0) {
print "In child\n";
system("hspice -i mont_read.sp -o out -mt 4"); wait;
sleep(.8); wait;
exit(0);
}
print "In parent \n";
$i = 0;
$mont_number = $j - 1;
out: while (1) {
$res = waitpid($pid, WNOHANG);
if ($res == -1) {
print "Successful Exit Process Detected\n";
system("mv out.mt0 mont_read.mt0"); wait;
sleep(1); wait;
system("perl monte_stat.pl > rel_out.txt"); wait ;
system("cat stat_result.txt rel_out.txt > stat_result.tmp"); wait;
system("mv stat_result.tmp stat_result.txt"); wait;
print "\nSim #$mont_number complete\n"; wait;
last out;
}
if ($res != -1) {
if ($i >= $timeout) {
$hang_count = $hang_count+1;
system("killall hspice"); wait;
sleep(1);
print("time_out complete\n"); wait;
print "\nSim #$mont_number complete\n"; wait;
last out;
}
if ($i < $timeout) {
sleep $slept; wait;
}
$i = $i+1;
}
}
这是错误:
Illegal division by zero at monte_stat.pl line 73, line 2.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73, line 1.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73, line 1.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73.
mv: cannot stat `out.mt0': No such file or directory
mv: cannot stat `out.mt0': No such file or directory
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73, line 3.
mv: cannot stat `out.mt0': No such file or directory
Illegal division by zero at monte_stat.pl line 73, line 1.
mv: cannot stat `out.mt0': No such file or directory
谁能给我一个在哪里调试的想法. 谢谢
Could anyone give me an idea where to look to debug it. thanks
推荐答案
根据错误,您的hslice
似乎崩溃了.但是还有其他问题.
According to the errors it appears that your hslice
is crashing. But there are other issues.
这首先是一个与您的代码尽可能接近的工作示例.
Here is first a working example as close as possible to your code.
use warnings;
use strict;
use feature 'say';
use POSIX qw(:sys_wait_h);
$| = 1;
my ($timeout, $duration, $sleep_time) = (5, 10, 1);
my $pid = fork // die "Can't fork: $!";
if ($pid == 0)
{
exec "echo JOB STARTS; sleep $duration; echo JOB DONE";
die "exec shouldn't return: $!";
}
say "Started $pid";
sleep 1;
my $tot_sec;
while (1)
{
my $ret = waitpid $pid, WNOHANG;
if ($ret > 0) { say "Child $ret exited with: $?"; last; }
elsif ($ret < 0) { say "\nNo such process ($ret)"; last; }
else { print " . " }
sleep $sleep_time;
if (($tot_sec += $sleep_time) > $timeout) {
say "\nTimeout. Send 15 (SIGTERM) signal to the process.";
kill 15, $pid;
last;
}
}
将(作业的)$duration
设置为3
,比$timeout
短,我们得到
With $duration
(of the job) set to 3
, shorter than $timeout
, we get
Started 16848
JOB STARTS
. . . JOB DONE
Child (JOB) 16848 exited with: 0
将$duration
设置为10
时,我们得到
Started 16550
JOB STARTS
. . . . .
Timeout. Send 15 (SIGTERM) signal to the process.
并且工作被终止(再等待5秒钟-n JOB DONE
不应该出现).
and the job is killed (wait for 5 more seconds – the JOB DONE
shouldn't show up).
对问题代码的评论
-
如果您
fork
仅运行作业,则没有理由使用system
.只需 exec 该程序
If you
fork
only to run a job there is no reason forsystem
. Just exec that program
在system
之后不需要等待,这是错误的. system
包含等待时间
No need for wait after system
, and it's wrong. The system
includes a wait
wait
不属于print
和sleep
,这是错误的
The wait
doesn't belong after print
and sleep
, and it's wrong
无需为了终止进程而掏空killall
No need to shell out for killall
in order to kill a process
如果最终使用system
,则该程序将在具有另一个PID的新进程中运行.然后,需要更多的时间来找到该PID并将其杀死.参见 Proc :: ProcessTable 和
If you end up using system
the program will run in a new process with another PID. Then more is needed to find that PID and kill it. See Proc::ProcessTable and this post, for example
上面的代码需要检查该进程是否确实被终止
The code above needs checks of whether the process was indeed killed
替换您的命令行而不是echo ...
并根据需要添加对它的检查.
Substitute your command line instead of echo ...
and add checks for it as needed.
另一种选择是简单地睡眠一段$timeout
时间,然后检查作业是否完成(孩子退出了).但是,使用您的方法可以在轮询时执行其他操作.
Another option is to simply sleep for a $timeout
period and then check whether the job is done (child exited). However, with your approach you can do other things while polling.
另一种选择是使用警报.
这篇关于计时一个分叉的过程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!