循环平铺。如何选择块的大小? [英] loop tiling. how to choose block size?

查看:315
本文介绍了循环平铺。如何选择块的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想学习循环优化。我发现,循环平铺有助于使阵列循环更快。我试着用$两块C $ CS使用和不使用循环阻塞如下,并测量所为的时间。我没有找到显著差异的大部分时间。我测试了不同的块大小,但我不知道如何选择块大小。请帮助我,如果我的方向是错误的。其实我发现无遮挡环工作快许多倍。

I am trying to learn the loop optimization. i found that loop tiling helps in making the array looping faster. i tried with two block of codes given below with and without loop blocking and measure the time taken for both. i did not find significant difference most of the time. i tested varying the block size but i am not sure how to choose the block size. please help me if my direction is wrong. in fact i found the loop without block works faster many times.

。随着阻塞

int max = 1000000;
int B = 100;
for (i = 0; i < max; i += B)
{
     for (j = i; j < (i + B); j++)
     {
         array[j] = 0;
     }
 }

乙。无堵塞

for (i = 0; i < max; i++)
{
    array[i] = 0;
 }

拍摄时间:
与阻塞:经过时间 - 6997000纳米秒]

time taken: with blocking: elapsed time - 6997000 Nano Secs

不阻塞经过的时间 - 6097000纳米秒]

Without blocking elapsed time - 6097000 Nano Secs

推荐答案

由于在这里指出,瓦片是为了让您的工作集缓存里,而你在它的工作,为了享受内存延迟的技术。如果您的数据流过一次你会看到没有任何好处,因为你永远不打缓存,所以瓦片不会有用。

As pointed out here, tiling is a technique meant to keep your working set inside the caches while you work on it, in order to enjoy the memory latency. If you stream over your data once you'll see no benefit since you never hit the cache, so tiling won't be useful.

您例如循环做同样的工作,在相同的顺序,除了略微增加另一分支,使分支模式更复杂的(最predictors将能够应付的,它只是不以任何方式有帮助)。

Your example loops do exactly the same work in the same order, except for adding another branch and making the branch patterns slightly more complicated (most predictors would be able to cope with that, it's just not helpful in any way).

考虑下面的例子 -

Consider the following example -

#include "stdlib.h"
#include "stdio.h"
#include <time.h>

#define MAX (1024*1024*32)
#define REP 100
#define B  (16*1024)
int main() { 
    int i,j,r;
    char array[MAX];

    for (i = 0; i < MAX; i++) {  // warmup to make things equal if array happens to fit in your L3
       array[i] = 0;
    }

    clock_t t1 = clock();

    // Tiled loop
    for (i = 0; i < MAX; i += B) {
        for (r = 0; r < REP; r++) {
             for (j = i; j < (i + B); j+=64) {
                 array[j] = r;
             }
        }
    }
    clock_t t2 = clock();

    // un-tiled loop
    for (r = 0; r < REP; r++) {
        for (i = 0; i < MAX; i+=64) {
             array[i] = r;
        }
    }

    clock_t t3 = clock();
    printf ("Tiled: %f sec\n", (double)(t2 - t1) / CLOCKS_PER_SEC);
    printf ("Untiled: %f sec\n", (double)(t3 - t2) / CLOCKS_PER_SEC);
    printf ("array[0] = %d\n", array[0]);    // to prevent optimizing out all the writes
}

两个回路都在做相同的访问(64字节的跳跃是使用每个高速缓存行一次强调缓存和preventing IPC和指令被瓶颈调度)。

Both loops are doing the same accesses (the 64byte jumps is to stress the caches by using each cache line once, and preventing the IPC and instruction dispatching from being the bottleneck).

瓷砖版本重新排序块这些访问使重复单块可以反复命中缓存。由于块大小设置为16K它可能适合大多数L1高速缓存,并得到一个很好的延迟。对于外部循环的每次迭代中,您将有1次迭代,你会错过所有的缓存,并转到内存(如果你L3大于32M只需拨打 MAX 甚至更高确保),从L1飞 REP-1 迭代。

The tiled version reorders these accesses in blocks so that repeating a single block can repeatedly hit the cache. Since block size is set to 16k it would probably fit in most L1 caches and get a really good latency. For each iteration of the outer loop, you'll have 1 iteration where you miss all caches and go to memory (if your L3 is bigger than 32M just dial MAX even higher to make sure), and REP-1 iterations that fly from the L1.

该Untiled版本也将重演 REP 次总,但每次重复都会鞭打从缓存中的所有数据使得所有访问去记忆,积累到更高的整体延迟。

The Untiled version would also repeat itself REP times in total, but every repetition will thrash all the data from the caches making all accesses go to the memory, accumulating to a much higher overall latency.

用gcc 4.8.1(-O3)编译使我对至强5670 @ 2.9GHz -

Compiling with gcc 4.8.1 (-O3) gives me on a Xeon 5670 @ 2.9GHz -

Tiled: 0.110000 sec
Untiled: 0.450000 sec
array[0] = 99

在4X:)

注意untiled版本还有一个好处 - 有一个有序的流所以汉王prefetcher可以冲在最前面有效地为你获取数据,有些减轻内存延迟效应。但是,你可以帮助CPU做类似的事情在开户版本,如果你加入以下内容:

Note that the untiled version still has one benefit - there's a single ordered stream so a HW prefetcher can run ahead fetching data for you effectively, somewhat mitigating the memory latency effect. However, you can help the CPU do something similar in the banked version, if you add to following :

for (i = 0; i < MAX; i += B) {
    for (r = 0; r < REP; r++) {
         for (j = i; j < (i + B); j+=64) {
             array[j] = r;
             if (r == REP - 2)                         // SW prefetching
                 __builtin_prefetch(&array[j+B], 0);    
         }
    }
}

告诉CPU中的下一个块带来略有提前完成当前之一。对于条件分支的价格(每块几MIS predictions),可以减少第一次迭代的执行时间相邻街区 - 我得到了进一步的降低:

Telling the CPU to bring in the next block slightly ahead of finishing the current one. For the price of a conditional branch (with a few mispredictions per block), you reduce the execution time of the first iteration on the next block - I get a further reduction from that to :

Tiled: 0.070000 sec

这篇关于循环平铺。如何选择块的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆