优化值N分裂阵列为矢量化数组所以它运行最快 [英] Optimizing the value N to split arrays up for vectorizing an array so it runs the quickest

查看:130
本文介绍了优化值N分裂阵列为矢量化数组所以它运行最快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图优化值的 N 分裂阵列为矢量化数组所以它运行最快的在不同的机器。我有以下

I'm trying to optimizing the value N to split arrays up for vectorizing an array so it runs the quickest on different machines. I have some test code below

#example use random values
clear all,
t=rand(1,556790);
inner_freq=rand(8193,6);

N=100; # use N chunks
nn = int32(linspace(1, length(t)+1, N+1))
aa_sig_combined=zeros(size(t));
total_time_so_far=0;
for ii=1:N
    tic;
    ind = nn(ii):nn(ii+1)-1;
    aa_sig_combined(ind) = sum(diag(inner_freq(1:end-1,2)) * cos(2 .* pi .* inner_freq(1:end-1,1) * t(ind)) .+ repmat(inner_freq(1:end-1,3),[1 length(ind)]));
    toc
    total_time_so_far=total_time_so_far+sum(toc)
end
fprintf('- Complete  test in %4.4fsec or %4.4fmins\n',total_time_so_far,total_time_so_far/60);

这需要162.7963sec或2.7133mins完成时16gig i7处理器的机器上N = 100运行Ubuntu

有没有办法,找出哪些值的 N 应得到这个在不同的机器上运行速度最快的?

Is there a way to find out what value N should be to get this to run the fastest on different machines?

PS:我的16gig i7处理器的Ubuntu 14.04运行倍频3.8.1,但它也将运行甚至1G的树莓派2。

PS: I'm running Octave 3.8.1 on 16gig i7 ubuntu 14.04 but it will also be running on even a 1 gig raspberry pi 2.

推荐答案

这是Matlab的测试脚本,我用的时间每个参数。返回用于第一迭代后,将它作为它看起来像迭代的其余部分是相似的。

This is the Matlab test script that I used to time each parameter. The return is used to break it after the first iteration as it looks like the rest of the iterations are similar.

%example use random values
clear all;
t=rand(1,556790);
inner_freq=rand(8193,6);

N=100; % use N chunks
nn = int32( linspace(1, length(t)+1, N+1) );
aa_sig_combined=zeros(size(t));

D = diag(inner_freq(1:end-1,2));
for ii=1:N
    ind = nn(ii):nn(ii+1)-1;
    tic;
    cosPara = 2 * pi * A * t(ind);
    toc;
    cosResult = cos( cosPara );
    sumParaA = D * cosResult;
    toc;
    sumParaB = repmat(inner_freq(1:end-1,3),[1 length(ind)]);
    toc;
    aa_sig_combined(ind) = sum( sumParaA + sumParaB );
    toc;
    return;
end

的输出被表示为如下。请注意,我有一个缓慢的电脑。

The output is indicated as follows. Note that I have a slow computer.

Elapsed time is 0.156621 seconds.
Elapsed time is 17.384735 seconds.
Elapsed time is 17.922553 seconds.
Elapsed time is 18.452994 seconds.

正如你所看到的,COS操作什么的这么长时间。你在一个8192x5568矩阵(45613056元),这是有道理的,它需要这么长时间运行的 COS

如果你想提高性能,使用 PARFOR ,因为它似乎每个迭代是独立的。假设你有100个内核,运行100次迭代,你的脚本将在完成的 17 秒+ PARFOR 开销。

If you wish to improve performance, use parfor as it appears each iteration is independent. Assuming you had 100 cores to run your 100 iterations, your script would be done in 17 seconds + parfor overhead.

COS 计算,你可能想看看,如果其他方法存在计算值的COS比股票的方法更快,更平行的。

Within the cos calculation, you might want to look into if another method exists to calculate cos of a value faster and more parallel than the stock method.

另一个小优化是这一行。它确保了诊断函数不是内环路称为对角矩阵是恒定的。你不想每次都生成一个8192x8192的对角矩阵!我只是把它存储在循环外,它给出了一个有点性能提升的同时。

Another minor optimization is this line. It ensures that the diag function isn't called within the loop as the diagonal matrix is constant. You don't want a 8192x8192 diagonal matrix to be generated every time! I just stored it outside the loop and it gives a bit of a performance boost as well.

D = diag(inner_freq(1:end-1,2));

请注意,我没有使用Matlab的轮廓,因为它没有为我工作,但你应该使用在未来多种官能code。

Note that I didn't use the Matlab profile as it didn't work for me, but you should use that in the future for more functionalized code.

这篇关于优化值N分裂阵列为矢量化数组所以它运行最快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆