使用每个迭代的先验工作量估算(对于给定数量的工人)并行循环的运行时间 [英] Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers)

查看:95
本文介绍了使用每个迭代的先验工作量估算(对于给定数量的工人)并行循环的运行时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究MATLAB的自适应矩阵矢量乘法的实现,该方法适用于来自PDE的特定离散化(具有稀疏结构)的非常大的稀疏矩阵.

经过大量的预处理,我最终得到了许多不同的块(例如大于200),我要针对这些块计算选定的条目.

预处理步骤之一是确定我要计算的每个块的(数量)个条目,这使我几乎可以完美地衡量每个块所花费的时间(对于所有意图和目的,正交每个条目的工作量都是相同的.

由于 https://stackoverflow.com/a/9938666/2965879 ,我得以利用通过以相反的顺序对块进行排序,从而使MATLAB首先从最大的块开始.

但是,条目的数量在块之间差异很大,以至于直接运行parfor受到条目数量最大的块的严重限制,即使它们被反向馈送到循环中.

我的解决方案是串行地执行最大的块(但在条目级别上并行!),只要每次迭代的开销都没关系,就可以了.块不会变得太小.然后我用parfor处理其余的块.理想情况下,我会让MATLAB决定如何处理此问题,但是由于嵌套的parfor循环会失去其并行性,因此这是行不通的.而且,将两个循环打包成一个循环是(几乎)不可能的.

我现在的问题是关于如何最好地确定串行和并行机制之间的截止点,同时考虑到我所获得的条目数量信息(针对不同的问题,有序条目的曲线形状可能会有所不同) ),以及我有空的工人人数.

到目前为止,我已经与标准PCT许可下的12名工作人员进行了合作,但是由于我现在开始在集群中工作,因此确定这一临界值变得越来越关键(因为对于许多核心而言,与并行循环相比,串行循环的开销变得越来越高,但是类似地,拥有容纳其余部分的块的代价甚至更高.

对于12个核心(分别是我正在使用的计算服务器的配置),我已经确定了一个合理的参数(每个工人100个条目)作为截止值,但是当核心数达到一定数量时,这种方法就无法正常工作相对于块数(例如64 vs 200),它已经不小了.

我试图缩小使用不同功率(例如1/2、3/4)的内核的数量,但这也不能始终如一地工作.接下来,我尝试将块分组,并确定条目大于每个批次的平均值时的临界值.他们距离末尾的批次数量:

logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
    i = i+1;
    m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order 
    logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;  
        % if the small blocks were parallelised perfectly, i.e. all  
        % cores take the same time, the time would be proportional to  
        % i*m. To try to discount the different sizes (and imperfect  
        % parallelisation), we only scale with a power of i less than  
        % one to not end up with a few blocks which hold up the rest  
end  
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);

(注意:此代码不适用于长度不为num_core倍数的向量num_entr_asc,但出于可读性考虑,我决定省略min(...,end)构造.)

我也省略了< max(...,...)来组合两个条件(即每个工人最少的条目),这是必要的,这样就不会太早发现截止点.我也考虑过以某种方式使用方差,但是到目前为止,所有尝试都不能令人满意.

如果有人对如何解决这个问题有个好主意,我将不胜感激.
感谢您阅读这个很长的问题,
最好的问候,
阿克塞尔(Axel)

Ps.由于我的亲爱的stackoverflow"似乎已被过滤掉,请多谢我已经在这里找到我的问题的解决方案.

解决方案

我想出了一个令人满意的解决方案,因此,如果有人感兴趣,我想我应该分享一下.我仍然希望就如何改进/微调该方法发表评论.

基本上,我认为唯一明智的方法是为并行循环建立调度程序的(非常)基本模型:

function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   c: Estimated cost of parallel computation

num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);

i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
    i=i+1;
    [~,i_min]=min(c); % which core finished first; is fed with next block
    c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end

c=max(c);

end

用于空迭代的参数cost_it是许多不同副作用的粗略混合物,可以想象将其分开:在for/parfor循环中进行空迭代的成本(也可能是不同的每块),以及启动时间. parfor循环的数据传输(可能还有更多).我将所有内容放在一起的主要原因是,我不需要估算/确定更精细的成本.

我使用上述例程通过以下方式确定截止点:

% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   i: Number of blocks to be calculated serially

num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);

for i=0:num_blocks
    cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
    cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end

[~,i]=min(sum(cost,2));
i=i-1;

end

尤其是,我不会增加/更改est_cost_para的值,而est_cost_para假定(除cost_it之外)可能是最乐观的计划.我之所以将其保留下来,主要是因为我不知道哪种方法最有效.为保守起见(即避免将太大的块提供给并行循环),当然可以增加一些百分比作为缓冲区,甚至可以使用大于1的幂来夸大并行成本.

还请注意,est_cost_para的调用使用了更少的块(尽管我在两个例程中都使用了变量名cost_blocks,但是一个是另一个的子集).

与我罗word的问题中的方法相比,我看到了两个主要优点:

  1. 在模拟调度程序中,数据(块数及其成本)与内核数之间相对复杂的依赖性要比单个公式捕获的好得多.
  2. 通过计算串行/并行分布的所有可能组合的成本,然后取最小值,当从一侧读取数据时(例如,相对于数据而言较大的跳转),不能太早地卡住"到目前为止,但与总数相比还是很小的.

当然,通过始终使用其while循环调用est_cost_para,渐近复杂性会更高,但是在我的情况下(num_blocks<500),这绝对可以忽略不计.

最后,如果cost_it的值本身不容易出现,则可以尝试通过测量每个块以及其纯并行部分的实际执行时间来计算它,然后尝试拟合将结果数据用于成本预测,并为例程的下一次调用获取更新的值cost_it(通过使用总成本与并行成本之间的差或通过将零成本插入拟合公式中).对于有问题的问题,希望可以将其收敛"到cost_it的最有用的值.

I'm working on a MATLAB implementation of an adaptive Matrix-Vector Multiplication for very large sparse matrices coming from a particular discretisation of a PDE (with known sparsity structure).

After a lot of pre-processing, I end up with a number of different blocks (greater than, say, 200), for which I want to calculate selected entries.

One of the pre-processing steps is to determine the (number of) entries per block I want to calculate, which gives me an almost perfect measure of the amount of time each block will take (for all intents and purposes the quadrature effort is the same for each entry).

Thanks to https://stackoverflow.com/a/9938666/2965879, I was able to make use of this by ordering the blocks in reverse order, thus goading MATLAB into starting with the biggest ones first.

However, the number of entries differs so wildly from block to block, that directly running parfor is limited severely by the blocks with the largest number of entries, even if they are fed into the loop in reverse.

My solution is to do the biggest blocks serially (but parallelised on the level of entries!), which is fine as long as the overhead per iterand doesn't matter too much, resp. the blocks don't get too small. The rest of the blocks I then do with parfor. Ideally, I'd let MATLAB decide how to handle this, but since a nested parfor-loop loses its parallelism, this doesn't work. Also, packaging both loops into one is (nigh) impossible.

My question now is about how to best determine this cut-off between the serial and the parallel regime, taking into account the information I have on the number of entries (the shape of the curve of ordered entries may differ for different problems), as well as the number of workers I have available.

So far, I had been working with the 12 workers available under a the standard PCT license, but since I've now started working on a cluster, determining this cut-off becomes more and more crucial (since for many cores the overhead of the serial loop becomes more and more costly in comparison to the parallel loop, but similarly, having blocks which hold up the rest are even more costly).

For 12 cores (resp. the configuration of the compute server I was working with), I had figured out a reasonable parameter of 100 entries per worker as a cut off, but this doesn't work well when the number of cores isn't small anymore in relation to the number of blocks (e.g 64 vs 200).

I've tried to deflate the number of cores with different powers (e.g. 1/2, 3/4), but this also doesn't work consistently. Next I tried to group the blocks into batches and determine the cut-off when entries are larger than the mean per batch, resp. the number of batches they are away from the end:

logical_sml = true(1,num_core); i = 0;
while all(logical_sml)
    i = i+1;
    m = mean(num_entr_asc(1:min(i*num_core,end))); % "asc" ~ ascending order 
    logical_sml = num_entr_asc(i*num_core+(1:num_core)) < i^(3/4)*m;  
        % if the small blocks were parallelised perfectly, i.e. all  
        % cores take the same time, the time would be proportional to  
        % i*m. To try to discount the different sizes (and imperfect  
        % parallelisation), we only scale with a power of i less than  
        % one to not end up with a few blocks which hold up the rest  
end  
num_block_big = num_block - (i+1)*num_core + sum(~logical_sml);

(Note: This code doesn't work for vectors num_entr_asc whose length is not a multiple of num_core, but I decided to omit the min(...,end) constructions for legibility.)

I've also omitted the < max(...,...) for combining both conditions (i.e. together with minimum entries per worker), which is necessary so that the cut-off isn't found too early. I thought a little about somehow using the variance as well, but so far all attempts have been unsatisfactory.

I'd be very grateful if someone has a good idea for how to solve this.
Thanks for reading this very long question,
Best regards,
Axel

Ps. Since my "Dear stackoverflow," seems to be filtered, let me say thanks for the numerous times I've already found the solution to a question of mine here.

解决方案

I came up with a somewhat satisfactory solution, so in case anyone's interested I thought I'd share it. I would still appreciate comments on how to improve/fine-tune the approach.

Basically, I decided that the only sensible way is to build a (very) rudimentary model of the scheduler for the parallel loop:

function c=est_cost_para(cost_blocks,cost_it,num_cores)
% Estimate cost of parallel computation

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   c: Estimated cost of parallel computation

num_blocks=numel(cost_blocks);
c=zeros(num_cores,1);

i=min(num_blocks,num_cores);
c(1:i)=cost_blocks(end-i+1:end)+cost_it;
while i<num_blocks
    i=i+1;
    [~,i_min]=min(c); % which core finished first; is fed with next block
    c(i_min)=c(i_min)+cost_blocks(end-i+1)+cost_it;
end

c=max(c);

end

The parameter cost_it for an empty iteration is a crude blend of many different side effects, which could conceivably be separated: The cost of an empty iteration in a for/parfor-loop (could also be different per block), as well as the start-up time resp. transmission of data of the parfor-loop (and probably more). My main reason to throw everything together is that I don't want to have to estimate/determine the more granular costs.

I use the above routine to determine the cut-off in the following way:

% function i=cutoff_ser_para(cost_blocks,cost_it,num_cores)
% Determine cut-off between serial an parallel regime

% Inputs:
%   cost_blocks: Estimate of cost per block in arbitrary units. For
%       consistency with the other code this must be in the reverse order
%       that the scheduler is fed, i.e. cost should be ascending!
%   cost_it:     Base cost of iteration (regardless of number of entries)
%       in the same units as cost_blocks.
%   num_cores:   Number of cores
%
% Output:
%   i: Number of blocks to be calculated serially

num_blocks=numel(cost_blocks);
cost=zeros(num_blocks+1,2);

for i=0:num_blocks
    cost(i+1,1)=sum(cost_blocks(end-i+1:end))/num_cores + i*cost_it;
    cost(i+1,2)=est_cost_para(cost_blocks(1:end-i),cost_it,num_cores);
end

[~,i]=min(sum(cost,2));
i=i-1;

end

In particular, I don't inflate/change the value of est_cost_para which assumes (aside from cost_it) the most optimistic scheduling possible. I leave it as is mainly because I don't know what would work best. To be conservative (i.e. avoid feeding too large blocks to the parallel loop), one could of course add some percentage as a buffer or even use a power > 1 to inflate the parallel cost.

Note also that est_cost_para is called with successively less blocks (although I use the variable name cost_blocks for both routines, one is a subset of the other).

Compared to the approach in my wordy question I see two main advantages:

  1. The relatively intricate dependence between the data (both the number of blocks as well as their cost) and the number of cores is captured much better with the simulated scheduler than would be possible with a single formula.
  2. By calculating the cost for all possible combinations of serial/parallel distribution and then taking the minimum, one cannot get "stuck" too early while reading in the data from one side (e.g. by a jump which is large relative to the data so far, but small in comparison to the total).

Of course, the asymptotic complexity is higher by calling est_cost_para with its while-loop all the time, but in my case (num_blocks<500) this is absolutely negligible.

Finally, if a decent value of cost_it does not readily present itself, one can try to calculate it by measuring the actual execution time of each block, as well as the purely parallel part of it, and then trying to fit the resulting data to the cost prediction and get an updated value of cost_it for the next call of the routine (by using the difference between total cost and parallel cost or by inserting a cost of zero into the fitted formula). This should hopefully "converge" to the most useful value of cost_it for the problem in question.

这篇关于使用每个迭代的先验工作量估算(对于给定数量的工人)并行循环的运行时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆