河套内存数据绑定的影响展开 [英] Effects of Loop unrolling on memory bound data

查看：141 发布时间：2016/8/22 16:09:52 c icc vtune loop-unrolling stencils

本文介绍了河套内存数据绑定的影响展开的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直用一块code这是密集的内存约束的。我试图通过手动实施缓存阻塞，SW prefetching，循环展开等，即使高速缓存阻塞使性能提高显著的单核内优化。然而，当我介绍循环展开，我得到了巨大的性能下降。

I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation.

我编译与英特尔的icc编译器与标志-O2和我所有的测试用例-ipo。

I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases.

我的code是与此类似（3D 25点型板）：

My code is similar to this (3D 25-point stencil):

    void stencil_baseline (double *V, double *U, int dx, int dy, int dz, double c0, double c1,     double c2, double c3, double c4)
   {
   int i, j, k;

   for (k = 4; k < dz-4; k++) 
   {
    for (j = 4; j < dy-4; j++) 
    {
        //x-direction
            for (i = 4; i < dx-4; i++) 
        {
            U[k*dy*dx+j*dx+i] =  (c0 * (V[k*dy*dx+j*dx+i]) //center
                +  c1 * (V[k*dy*dx+j*dx+(i-1)] + V[k*dy*dx+j*dx+(i+1)])                 
                +  c2 * (V[k*dy*dx+j*dx+(i-2)] + V[k*dy*dx+j*dx+(i+2)])     
                +  c3 * (V[k*dy*dx+j*dx+(i-3)] + V[k*dy*dx+j*dx+(i+3)]) 
                +  c4 * (V[k*dy*dx+j*dx+(i-4)] + V[k*dy*dx+j*dx+(i+4)]));

        }

        //y-direction   
        for (i = 4; i < dx-4; i++) 
        {
            U[k*dy*dx+j*dx+i] += (c1 * (V[k*dy*dx+(j-1)*dx+i] + V[k*dy*dx+(j+1)*dx+i])
                + c2 * (V[k*dy*dx+(j-2)*dx+i] + V[k*dy*dx+(j+2)*dx+i])
                + c3 * (V[k*dy*dx+(j-3)*dx+i] + V[k*dy*dx+(j+3)*dx+i]) 
                + c4 * (V[k*dy*dx+(j-4)*dx+i] + V[k*dy*dx+(j+4)*dx+i]));
        }

        //z-direction
        for (i = 4; i < dx-4; i++) 
        {
            U[k*dy*dx+j*dx+i] += (c1 * (V[(k-1)*dy*dx+j*dx+i] + V[(k+1)*dy*dx+j*dx+i])
                + c2 * (V[(k-2)*dy*dx+j*dx+i] + V[(k+2)*dy*dx+j*dx+i])
                + c3 * (V[(k-3)*dy*dx+j*dx+i] + V[(k+3)*dy*dx+j*dx+i]) 
                + c4 * (V[(k-4)*dy*dx+j*dx+i] + V[(k+4)*dy*dx+j*dx+i]));

        }

    }
   }

 }

当我这样做循环展开的最内层的循环（尺寸i）和分别展开因素2,4,8分别展开的方向X，Y，Z，我的方向得到9例即解开性能下降2 X，在y方向展开2，通过2 Z方向展开，在方向x ...等4展开
但是，当我由8（2及4也可以）的因素在最外层循环进行循环展开（维金），我得到v.good性能的改善比缓存阻塞甚至更好

When I do loop unrolling on the innermost loop (dimension i) and unroll in directions x,y,z separately by unroll factor 2,4,8 respectively, I get performance degradation in all 9 cases i.e. unroll by 2 on direction x, unroll by 2 on direction y, unroll by 2 in direction z, unroll by 4 in direction x ... etc. But when I perform loop unrolling on the outermost loop (dimension k) by factor of 8 (2 & 4 also), I get v.good performance improvement which is even better than cache blocking.

我甚至尝试分析我的code。与英特尔VTune。这似乎是瓶颈，其中主要是由于1.LLC小姐和2 LLC加载未命中通过远程服务DRAM

I even tried profiling my code with Intel Vtune. It seemed like the bottlenecks where mainly due to 1.LLC Miss and 2. LLC Load Misses serviced by Remote DRAM.

我无法理解为什么在给性能下降而展开的最外层，最慢的维度获取的性能提升展开的最里面最快的循环。但是，这种改善在后一种情况是，当我使用-O2和ICC编译时-ipo。

I am unable to understand why unrolling the innermost fastest loop in giving performance degradation whereas unrolling the outermost, slowest dimension is fetching performance improvement. However, this improvement in the latter case is when i use -O2 and -ipo when compiling with icc.

我不知道如何跨preT这些统计信息。有人可以帮助阐明这一些轻。

I am not sure how to interpret these statistics. Can someone help shed some light on this.

河套内存数据绑定的影响展开 [英] Effects of Loop unrolling on memory bound data

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

河套内存数据绑定的影响展开 [英] Effects of Loop unrolling on memory bound data

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭