MATLAB中元素访问的时间复杂度 [英] Time complexity of element accessing in MATLAB

查看:342
本文介绍了MATLAB中元素访问的时间复杂度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下两个代码段执行相同的任务(从N维球体均匀生成M个样本).我想知道为什么后一个比前一个消耗更多的时间.

The following two code snippets perform the same task (generating M samples uniformly from an N-dim sphere). I was wondering why the latter one consumes much more time than the previous one.

%% MATLAB R2014a    
M = 30;
N = 10000;    

#1
tic
S = zeros(M, N);
for k = 1:M
    P = ones(1, N);
    for i = 1:N - 1
        t = rand*2*pi;
        P(1:i) = P(1:i)*sin(t);
        P(i+1) = P(i+1)*cos(t);
    end
    S(k,:) = P;
end
toc

#2
tic
S = ones(M, N);
for k = 1:M
    for i = 1:N - 1
        t = rand*2*pi;
        S(k, 1:i) = S(k, 1:i)*sin(t);
        S(k, i+1) = S(k, i+1)*cos(t);
    end
end
toc

输出为:

Elapsed time is 15.007667 seconds.
Elapsed time is 59.745311 seconds.

我也尝试过M = 1,

And I also tried M = 1,

Elapsed time is 0.463370 seconds.
Elapsed time is 1.566913 seconds.

#2比#1慢近4倍.在#2中频繁访问2d元素是否耗时?

#2 is nearly 4 times slower than #1. Is frequent 2d element accessing in #2 making it time-consuming?

推荐答案

时间差异归因于内存访问模式以及它们映射到缓存的程度.也可能是MATLAB对您的硬件矢量单元(SSE/AVX)的利用. MATLAB存储矩阵"column-major",表示S(2,1)S(1,1)旁边.

The time difference is due to memory access patterns, and how well they map onto the cache. And also possibly to MATLAB's exploitation of your hardware vector unit (SSE/AVX). MATLAB stores matrices "column-major", meaning S(2,1) is next to S(1,1).

在#1中,您使用存在于连续内存中的向量P处理每个样本.这些80,000字节很容易放入L2高速缓存中,以实现您需要执行的快速重复访问.它们也是邻居,并且被简单地矢量化了(我不确定MATLAB是否执行了此优化,但我希望如此...)

In #1, you process each sample using the vector P, which lives in contiguous memory. These 80,000 bytes fit easily in L2 cache for the fast repeated access you need to perform. They're also neighbors, and trivially vectorized (I'm not certain if MATLAB performs this optimization, but I'd hope so...)

在#2中,您一次访问一行S,这不是连续的,而是与M值交错.因此,每一行分布在30 * 80,000字节中,这不适用于L2高速缓存.即使您忽略该数据中的29/30个值,也必须为每次重复访问重新读取它.

In #2, you access a row of S at a time, which is not contiguous, but rather is interleaved by M values. So each row is spread across 30*80,000 bytes, which does not fit in L2 cache. It'll have to be read back in for each repeated access, even though you're ignoring 29/30 values in that data.

这是测试.我正在做的所有工作都是换位S,以便您一次可以处理一列,然后将其放回末尾以得到相同的结果:

Here's the test. All I'm doing it transposing S so that you can process a column at a time instead, then putting it back at the end just to get the same result:

#3
tic
S = ones(N, M);
for k = 1:M
    for i = 1:N - 1
        t = rand*2*pi;
        S(1:i, k) = S(1:i, k)*sin(t);
        S(i+1, k) = S(i+1, k)*cos(t);
    end
end
S = S.';
toc

结果:

Elapsed time is 11.254212 seconds.
Elapsed time is 45.847750 seconds.
Elapsed time is 11.501580 seconds.

是的,转置S与分离向量方法具有相同的连续访问和性能.顺便说一句,L3与L2的时钟周期大约增加了4倍... 1

Yep, transposing S gets us the same contiguous access and performance as the separate vector approach. By the way, L3 vs. L2 is about 4x more clock cycles... 1

让我们看看是否可以找到与缓存大小有关的任何断点.这是N = 1000,其中所有内容都应适合L2:

Let's see if we can find any breakpoints related to cache size. Here's N = 1000, where everything should fit in L2:

Elapsed time is 0.240184 seconds.
Elapsed time is 0.373448 seconds.
Elapsed time is 0.258566 seconds.

差异要小得多,尽管现在我们可能会进入L1效果.

Much lower difference, though now we're probably into L1 effects.

最后,这是解决问题的一种完全不同的方法.它依赖于多元正态RV具有正确的对称性这一事实.

Finally, here's a completely different way to solve your problem. It relies on the fact that multivariate normal RV's have the correct symmetry.

#4
tic
S = randn(M, N);
S = bsxfun(@rdivide, S, sqrt(sum(S.*S, 2)));
toc

Elapsed time is 10.714104 seconds.
Elapsed time is 45.351277 seconds.
Elapsed time is 11.031061 seconds.
Elapsed time is 0.015068 seconds.

这篇关于MATLAB中元素访问的时间复杂度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆