parfor中的内存不足错误:杀死从属设备,而不是主设备 [英] Out-of-memory error in parfor: kill the slave, not the master
问题描述
当parfor
中出现内存不足错误时,是否有办法杀死仅一个Matlab从站以释放一些内存,而不是终止整个脚本?
When an out-of-memory error is raised in a parfor
, is there any way to kill only one Matlab slave to free some memory instead of having the entire script terminate?
当parfor
中发生内存不足错误时,默认情况是这样:脚本终止,如下面的屏幕快照所示.
Here is what happens by default when an out-of-memory error occurs in a parfor
: the script terminated, as shown in the screenshot below.
我希望有一种方法可以杀死一个奴隶(即从parpool
中删除一个工作程序)或停止使用它来释放尽可能多的内存:
I wish there was a way to just kill one slave (i.e. removing a worker from parpool
) or stop using it to release as much memory as possible from it:
推荐答案
如果主进程内存不足,则没有机会解决此问题.对于从属服务器上的内存不足,应执行以下操作:
If you get a out of memory in the master process there is no chance to fix this. For out of memory on the slave, this should do it:
代码的简单思想:用丢失的数据一次又一次地重新启动parfor,直到获得所有结果.如果一次迭代失败,将写入一个标志(文件),让我们在所有第一次迭代发生时立即抛出一个错误.这样一来,我们就可以摆脱循环",而不会浪费时间浪费掉其他内存.
The simple idea of the code: Restart the parfor again and again with the missing data until you get all results. If one iteration fails, a flag (file) is written which let's all iterations throw an error as soon as the first error occurred. This way we get "out of the loop" without wasting time producing other out of memory.
%Your intended iterator
iterator=1:10;
%flags which indicate what succeeded
succeeded=false(size(iterator));
%result array
result=nan(size(iterator));
FLAG='ANY_WORKER_CRASHED';
while ~all(succeeded)
fprintf('Another try\n')
%determine which iterations should be done
todo=iterator(~succeeded);
%initialize array for the remaining results
partresult=nan(size(todo));
%initialize flags which indicate which iterations succeeded (we can not
%throw erros, it throws aray results)
partsucceeded=false(size(todo));
%flag indicates that any worker crashed. Have to use file based
%solution, don't know a better one. #'
delete(FLAG);
try
parfor falseindex=1:sum(~succeeded)
realindex=todo(falseindex);
try
% The flag is used to let all other workers jump out of the
% loop as soon as one calculation has crashed.
if exist(FLAG,'file')
error('some other worker crashed');
end
% insert your code here
%dummy code which randomly trowsexpection
if rand<.5
error('hit out of memory')
end
partresult(falseindex)=realindex*2
% End of user code
partsucceeded(falseindex)=true;
fprintf('trying to run %d and succeeded\n',realindex)
catch ME
% catch errors within workers to preserve work
partresult(falseindex)=nan
partsucceeded(falseindex)=false;
fprintf('trying to run %d but it failed\n',realindex)
fclose(fopen(FLAG,'w'));
end
end
catch
%reduce poolsize by 1
newsize = matlabpool('size')-1;
matlabpool close
matlabpool(newsize)
end
%put the result of the current iteration into the full result
result(~succeeded)=partresult;
succeeded(~succeeded)=partsucceeded;
end
这篇关于parfor中的内存不足错误:杀死从属设备,而不是主设备的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!