向工作人员发送数据 [英] Sending data to workers
问题描述
我正在尝试创建一段并行代码,以加快处理非常大(两亿行)的数组的速度.为了并行处理,我将我的数据切成8个(我的内核数),并尝试向每个工作人员发送1个.但是,从我的RAM使用情况来看,似乎每件作品都发送给每个工作人员,有效地将我的RAM使用量乘以8.
I am trying to create a piece of parallel code to speed up the processing of a very large (couple of hundred million rows) array. In order to parallelise this, I chopped my data into 8 (my number of cores) pieces and tried sending each worker 1 piece. Looking at my RAM usage however, it seems each piece is send to each worker, effectively multiplying my RAM usage by 8. A minimum working example:
A = 1:16;
for ii = 1:8
data{ii} = A(2*ii-1:2*ii);
end
现在,当我使用parfor
将此数据发送给工作人员时,似乎发送的是完整的单元格,而不是仅发送所需的片段:
Now, when I send this data to workers using parfor
it seems to send the full cell instead of just the desired piece:
output = cell(1,8);
parfor ii = 1:8
output{ii} = data{ii};
end
我实际上在parfor
循环中使用了某些功能,但这说明了这种情况. MATLAB实际上是否将完整的单元格data
发送给每个工作人员,如果是,如何使它仅发送所需的部分?
I actually use some function within the parfor
loop, but this illustrates the case. Does MATLAB actually send the full cell data
to each worker, and if so, how to make it send only the desired piece?
推荐答案
根据我的个人经验,我发现使用parfeval
在内存使用方面要比parfor
更好.此外,您的问题似乎更容易解决,因此您可以使用parfeval
向MATLAB工作人员提交更多较小的工作.
In my personal experience, I found that using parfeval
is better regarding memory usage than parfor
. In addition, your problem seems to be more breakable, so you can use parfeval
for submitting more smaller jobs to MATLAB workers.
假设您有要处理jobCnt
作业的workerCnt
MATLAB工人.假设data
是大小为jobCnt x 1
的单元格数组,并且其每个元素都对应于函数getOutput
的数据输入,该函数对数据进行分析.然后将结果存储在大小为jobCnt x 1
的单元格数组output
中.
Let's say that you have workerCnt
MATLAB workers to which you are gonna handle jobCnt
jobs. Let data
be a cell array of size jobCnt x 1
, and each of its elements corresponds to a data input for function getOutput
which does the analysis on data. The results are then stored in cell array output
of size jobCnt x 1
.
在第一个for
循环中分配作业,并在第二个while
循环中检索结果.布尔变量doneJobs
指示要完成的工作.
in the following code, jobs are assigned in the first for
loop and the results are retrieved in the second while
loop. The boolean variable doneJobs
indicates which job is done.
poolObj = parpool(workerCnt);
jobCnt = length(data); % number of jobs
output = cell(jobCnt,1);
for jobNo = 1:jobCnt
future(jobNo) = parfeval(poolObj,@getOutput,...
nargout('getOutput'),data{jobNo});
end
doneJobs = false(jobCnt,1);
while ~all(doneJobs)
[idx,result] = fetchnext(future);
output{idx} = result;
doneJobs(idx) = true;
end
此外,如果要节省更多的内存,可以将这种方法更进一步.您可以做的是,在获取完成的工作的结果之后,可以删除future
的相应成员.原因是该对象存储了getOutput
函数的所有输入和输出数据,这可能会很大.但是您需要小心,因为删除future
的成员会导致索引移位.
Also, you can take this approach one step further if you want to save up more memory. What you could do is that after fetching the results of a done job, you can delete the corresponding member of future
. The reason is that this object stores all the input and output data of getOutput
function which probably is going to be huge. But you need to be careful, as deleting members of future
results index shift.
以下是我为此汗水编写的代码.
The following is the code I wrote for this porpuse.
poolObj = parpool(workerCnt);
jobCnt = length(data); % number of jobs
output = cell(jobCnt,1);
for jobNo = 1:jobCnt
future(jobNo) = parfeval(poolObj,@getOutput,...
nargout('getOutput'),data{jobNo});
end
doneJobs = false(jobCnt,1);
while ~all(doneJobs)
[idx,result] = fetchnext(future);
furure(idx) = []; % remove the done future object
oldIdx = 0;
% find the index offset and correct index accordingly
while oldIdx ~= idx
doneJobsInIdxRange = sum(doneJobs((oldIdx + 1):idx));
oldIdx = idx
idx = idx + doneJobsInIdxRange;
end
output{idx} = result;
doneJobs(idx) = true;
end
这篇关于向工作人员发送数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!