无法向量化计算时,如何加快Matlab嵌套循环? [英] how to speed up Matlab nested for loops when I cannot vectorize the calculations?

查看:112
本文介绍了无法向量化计算时,如何加快Matlab嵌套循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个大小相同的大型3D数组[41 * 141 * 12403],在Matlab代码中分别以alpha,beta和ni命名.从它们中,我需要计算另一个具有相同大小的3D数组,该数组是通过使用每个元素的值将无穷大和定积分计算相结合的计算从原始矩阵逐元素获得的.因此,似乎不可避免地必须使用多个嵌套循环来进行此计算.该代码现在已经运行了几个小时(!),并且仍处于外循环的第一次迭代中(该循环需要执行41次!!根据我的计算,这样,程序将必须运行超过两年!!!).我不知道如何优化代码.请帮我 !!

I have three big 3D arrays of the same size [41*141*12403], named in the Matlab code below alpha, beta and ni. From them I need to calculate another 3D array with the same size, which is obtained elementwise from the original matrices through a calculation that combines an infinite sum and definite integral calculations, using the value of each element. It therefore seems inevitible to have to use several nested loops to make this calculation. The code is already running now for several hours(!) and it is still in the first iteration of the outer loop (which needs to be performed 41 times!! According to my calculation, in this way the program will have to run more than two years!!!). I don't know how to optimize the code. Please help me !!

我使用的代码:

    z_len=size(KELDYSH_PARAM_r_z_t,1);   % 41 rows
    r_len=size(KELDYSH_PARAM_r_z_t,2);   % 141 columns   
    t_len=size(KELDYSH_PARAM_r_z_t,3);   % 12403 slices

    sumRes=zeros(z_len,r_len,t_len);

    for z_ind=1:z_len
        z_ind     % in order to track the advancement of the calculation
        for r_ind=1:r_len
            for t_ind=1:t_len
                sumCurrent=0;
                sumPrevious=inf;
                s=0;

                while abs(sumPrevious-sumCurrent)>1e-6
                    kapa=kapa_0+s;    %some scalar
                    x_of_w=(beta(z_ind,r_ind,t_ind).*(kapa-ni...
                       (z_ind,r_ind,t_ind))).^0.5;               
                    sumPrevious=sumCurrent;
                    sumCurrent=sumCurrent+exp(-alpha(z_ind,r_ind,t_ind).* ...
                        (kapa-ni(z_ind,r_ind,t_ind))).*(x_of_w.^(2*abs(m)+1)/2).* ...
                            w_m_integral(x_of_w,m);
                    s=s+1;
                end

                sumRes(z_ind,r_ind,t_ind)=sumCurrent;
            end
        end
    end
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function  res=w_m_integral(x_of_w,m)

    res=quad(@integrandFun,0,1,1e-6);

    function y=integrandFun(t)
            y=exp(-x_of_w^2*t).*t.^(abs(m))./((1-t).^0.5);
    end
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

推荐答案

选项1-更多向量化

这是一个您正在使用的非常复杂的模型,虽然并未解释所有术语,但仍可以对某些部分进行进一步的矢量化处理.您的alphabetani矩阵大概是静态的并且是预先计算的?您的s值是一个标量,而kapa可能是一个,因此您也可以一次性计算x_of_w矩阵.尽管您会花很多时间来获得内存,但是这会给您一个很小的加速效果-这些天可以达到7,100万点,但需要大量的硬件.对您的41行中的每一行都执行一次,这样可以减少负担.

It's a pretty complex model you're working with and not all the terms are explained, but some parts can still be further vectorised. Your alpha, beta and ni matrices are presumably static and precomputed? Your s value is a scalar and kapa could be either, so you can probably precompute the x_of_w matrix all in one go too. This would give you a very slight speedup all on its own, though you'd be spending memory to get it - 71 million points is doable these days but will call for an awful lot of hardware. Doing it once for each of your 41 rows would reduce the burden neatly.

剩下积分本身. quad函数不接受向量输入-这将是一场噩梦,不是吗? -而MathWorks建议您使用的integral都不会.但是,如果每种情况下您的积分限制都相同,那么为什么不采用老式的积分方式呢?计算一个积分为1的矩阵,计算另一个积分为0的矩阵,然后求和.

That leaves the integral itself. The quad function doesn't accept vector inputs - it would be a nightmare wouldn't it? - and neither does integral, which Mathworks are recommending you use instead. But if your integration limits are the same in each case then why not do the integral the old-fashioned way? Compute a matrix for the value of the integrand at 1, compute another matrix for the value of the integrand at 0 and then take the difference.

然后,您可以编写一个循环,计算整个输入空间的积分,然后测试所有矩阵元素的收敛性.制作一个遮罩以记录尚未收敛的蒙版,然后重新计算s增大的蒙版.重复进行,直到全部收敛(或者您达到迭代的阈值).

Then you can write a single loop that computes the integral for the whole input space then tests the convergence for all the matrix elements. Make a mask that notes the ones that have not converged and recalculate those with the increased s. Repeat until all have converged (or you hit a threshold for iterations).

选项2-并行化

过去,矢量化操作比循环要快得多.我现在找不到它的来源,但我想我已经读到它最近也通过for循环变得越来越快,因此根据可用资源的不同,通过并行化当前代码可能会获得更好的结果有.这也需要一点重构-最大的问题是将数据复制到工作程序时的开销(您可以通过将输入切成小块并仅将相关的数据输入来解决),并且parfor循环不允许您可以使用某些变量,通常是覆盖整个空间的变量.再次将它们切碎会有所帮助.

It used to be the case that matlab was much faster with vectorised operations than loops. I can't find a source for it now but I think I've read that it's become a lot faster recently with for loops too, so depending on the resources you have available you might get better results by parallelising the code you currently have. That's going to need a bit of refactoring too - the big problems are overheads while copying in data to the workers (which you can fix by chopping the inputs up into chunks and just feeding the relevant one in) and the parfor loop not allowing you to use certain variables, usually ones which cover the whole space. Again chopping them up helps.

但是,如果您有2年的运行时间,则我估计至少需要100倍,这意味着群集!如果您在大学或某个可以在500核集群上工作几天的地方,那么就去做吧……

But if you have a 2 year runtime you will need a factor of at least 100 I'm guessing, so that means a cluster! If you're at a university or somewhere where you might be able to get a few days on a 500-core cluster then go for that...

如果您可以用封闭形式编写积分,则可能适合GPU计算.这些事情可以非常快地完成某些类别的计算,但是您必须能够并行化工作,并将实际计算减少为主要由加法和乘法组成的基本运算. CUDA库已经完成了很多工作,并且 matlab有一个接口因此,请阅读这些内容.

If you can write the integral in a closed form then it might be amenable to GPU computation. Those things can do certain classes of computation very fast but you have to be able to parallelise the job and reduce the actual computation to something basic comprised mainly of addition and multiplication. The CUDA libraries have done a lot of the legwork and matlab has an interface to them so have a read about those.

选项3-缩小范围

最后,如果以上两种方法均未导致足够的加速,那么您可能必须缩小计算范围.尽可能多地调整输入空间,并接受较低的收敛阈值.如果您知道最里面的while循环(其中有s计数器的循环)内通常需要进行多少次迭代,那么可能会发现减少收敛准则会减少所需的迭代次数,这可能会加快速度起来吧.探查器可以帮助您了解您在哪里花费时间.

Finally, if neither of the above two results in sufficient speedups, then you may have to reduce the scope of your calculation. Trim the input space as much as you can and perhaps accept a lower convergence threshold. If you know how many iterations you tend to need inside the innermost while loop (the one with the s counter in it) then it might turn out that reducing the convergence criterion reduces the number of iterations you need, which could speed it up. The profiler can help see where you're spending your time.

最重要的是,有7100万个点需要花费一些时间来计算.到目前为止,您只能优化计算,很可能对于这种大小的问题,您将不得不对其进行硬件投掷.

The bottom line though is that 71 million points are going to take some time to compute. You can optimise the computation only so far, the odds are that for a problem of this size you will have to throw hardware at it.

这篇关于无法向量化计算时,如何加快Matlab嵌套循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆