河套内存数据绑定的影响展开 [英] Effects of Loop unrolling on memory bound data

查看:141
本文介绍了河套内存数据绑定的影响展开的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直用一块code这是密集的内存约束的。我试图通过手动实施缓存阻塞,SW prefetching,循环展开等,即使高速缓存阻塞使性能提高显著的单核内优化。然而,当我介绍循环展开,我得到了巨大的性能下降。

I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation.

我编译与英特尔的icc编译器与标志-O2和我所有的测试用例-ipo。

I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases.

我的code是与此类似(3D 25点型板):

My code is similar to this (3D 25-point stencil):

    void stencil_baseline (double *V, double *U, int dx, int dy, int dz, double c0, double c1,     double c2, double c3, double c4)
   {
   int i, j, k;

   for (k = 4; k < dz-4; k++) 
   {
    for (j = 4; j < dy-4; j++) 
    {
        //x-direction
            for (i = 4; i < dx-4; i++) 
        {
            U[k*dy*dx+j*dx+i] =  (c0 * (V[k*dy*dx+j*dx+i]) //center
                +  c1 * (V[k*dy*dx+j*dx+(i-1)] + V[k*dy*dx+j*dx+(i+1)])                 
                +  c2 * (V[k*dy*dx+j*dx+(i-2)] + V[k*dy*dx+j*dx+(i+2)])     
                +  c3 * (V[k*dy*dx+j*dx+(i-3)] + V[k*dy*dx+j*dx+(i+3)]) 
                +  c4 * (V[k*dy*dx+j*dx+(i-4)] + V[k*dy*dx+j*dx+(i+4)]));

        }

        //y-direction   
        for (i = 4; i < dx-4; i++) 
        {
            U[k*dy*dx+j*dx+i] += (c1 * (V[k*dy*dx+(j-1)*dx+i] + V[k*dy*dx+(j+1)*dx+i])
                + c2 * (V[k*dy*dx+(j-2)*dx+i] + V[k*dy*dx+(j+2)*dx+i])
                + c3 * (V[k*dy*dx+(j-3)*dx+i] + V[k*dy*dx+(j+3)*dx+i]) 
                + c4 * (V[k*dy*dx+(j-4)*dx+i] + V[k*dy*dx+(j+4)*dx+i]));
        }

        //z-direction
        for (i = 4; i < dx-4; i++) 
        {
            U[k*dy*dx+j*dx+i] += (c1 * (V[(k-1)*dy*dx+j*dx+i] + V[(k+1)*dy*dx+j*dx+i])
                + c2 * (V[(k-2)*dy*dx+j*dx+i] + V[(k+2)*dy*dx+j*dx+i])
                + c3 * (V[(k-3)*dy*dx+j*dx+i] + V[(k+3)*dy*dx+j*dx+i]) 
                + c4 * (V[(k-4)*dy*dx+j*dx+i] + V[(k+4)*dy*dx+j*dx+i]));

        }

    }
   }

 }

当我这样做循环展开的最内层的循环(尺寸i)和分别展开因素2,4,8分别展开的方向X,Y,Z,我的方向得到9例即解开性能下降2 X,在y方向展开2,通过2 Z方向展开,在方向x ...等4展开
但是,当我由8(2及4也可以)的因素在最外层循环进行循环展开(维金),我得到v.good性能的改善比缓存阻塞甚至更好

When I do loop unrolling on the innermost loop (dimension i) and unroll in directions x,y,z separately by unroll factor 2,4,8 respectively, I get performance degradation in all 9 cases i.e. unroll by 2 on direction x, unroll by 2 on direction y, unroll by 2 in direction z, unroll by 4 in direction x ... etc. But when I perform loop unrolling on the outermost loop (dimension k) by factor of 8 (2 & 4 also), I get v.good performance improvement which is even better than cache blocking.

我甚至尝试分析我的code。与英特尔VTune。这似乎是瓶颈,其中主要是由于1.LLC小姐和2 LLC加载未命中通过远程服务DRAM

I even tried profiling my code with Intel Vtune. It seemed like the bottlenecks where mainly due to 1.LLC Miss and 2. LLC Load Misses serviced by Remote DRAM.

我无法理解为什么在给性能下降而展开的最外层,最慢的维度获取的性能提升展开的最里面最快的循环。但是,这种改善在后一种情况是,当我使用-O2和ICC编译时-ipo。

I am unable to understand why unrolling the innermost fastest loop in giving performance degradation whereas unrolling the outermost, slowest dimension is fetching performance improvement. However, this improvement in the latter case is when i use -O2 and -ipo when compiling with icc.

我不知道如何跨preT这些统计信息。有人可以帮助阐明这一些轻。

I am not sure how to interpret these statistics. Can someone help shed some light on this.

推荐答案

这有力地表明,你被展开,这是典型的造成指令高速缓存未命中。在现代硬件的时代,展开不再自动意味着更快的code。如果每个内环高速缓存中的行适合,您将得到更好的性能。

This strongly suggests that you are causing instruction cache misses by the unrolling, which is typical. In the age of modern hardware, unrolling no longer automatically means faster code. If each inner loop fits in a cache line, you will get better performance.

您可能能够手动展开,限制产生的code的大小,但这将需要检查生成的机器语言指令 - 和位置 - 以确保您的循环是在一个单一的高速缓存行。高速缓存行通常为64字节长,在64字节边界对齐。

You may be able to unroll manually, to limit the size of the generated code, but this will require examining the generated machine-language instructions -- and their position -- to ensure that your loop is within a single cache line. Cache lines are typically 64 bytes long, and aligned on 64-byte boundaries.

外环不具有相同的效果。它们很可能将指令高速缓冲存储器的外侧无论解开的水平。更少的分支解开这些结果,这就是为什么你会得到更好的性能。

Outer loops do not have the same effect. They will likely be outside of the instruction cache regardless of the unroll level. Unrolling these results in fewer branches, which is why you get better performance.

远程服务DRAM装载未命中是指你分配了一个NUMA节点上的内存,但现在你是在其他正在运行的。基于NUMA设置进程或线程的亲和力就是答案。

"Load misses serviced by remote DRAM" means that you allocated memory on one NUMA node, but now you are running on the other. Setting process or thread affinity based on NUMA is the answer.

远程DRAM需要近两倍的时间来上,我已经使用了Intel机器读取本地DRAM。

Remote DRAM takes almost twice as long to read as local DRAM on the Intel machines that I have used.

这篇关于河套内存数据绑定的影响展开的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆