code连续存取L1缓存慢于code访问3X非连续L1存储器 [英] Code accessing continuous L1 cache memory slower than code accessing 3x non-continuous L1 memory

查看:522
本文介绍了code连续存取L1缓存慢于code访问3X非连续L1存储器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题的更新,这一点:<一href=\"http://stackoverflow.com/questions/23162398/accessing-three-static-arrays-is-quicker-than-one-static-array-containing-3x-dat\">Accessing三个静态数组是包含3倍的数据速度比一个静态数组?,但我有太多的更新,因此这是有意义的开始一个新的。

This question is an update to this: Accessing three static arrays is quicker than one static array containing 3x data? but I have too many updates so it made sense to start a new one.

我有700项目,每个项目中有三个属性 - A B ë。我通过每一个项目环和使用三个atrributes来计算两个全球性的属性称为 C D 值。

I have 700 "items" and each item has three attributes- a, b and e. I loop through every item and use the three atrributes to calculate values for two "global" attributes called c and d.

我用了两个技术来实现这一点:

I used two techniques to implement this:

1)三700元件阵列,一个阵列的每个三个属性的:

1) Three 700-element arrays, one array for each of the three attributes:

 item0.a = array1[0]
 item0.b = array2[0]
 item0.e = array3[0]

2)含有数据连续三个属性一2100元素的数组:

2) One 2100-element array containing data for the three attributes consecutively:

 item0.a = array[offset + 0]
 item0.b = array[offset + 1]
 item0.e = array[offset + 2]

现在三大项属性 A B 电子是在环 - 中一起使用,因此会更有意义,如果你把它们存储在一个阵列中的表现应该比如果使用三阵技术更好。

Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique.

这是因为在技术#1中的每个的三个属性(每相同项)都位于相隔(700×4字节)。因此,我将不得不检索项目三个不同的高速缓存行。

This is because in technique #1 each of the three attributes (per same item) are located (700 x 4 bytes) apart. I would therefore have to retrieve three different cache lines for an item.

然而,随着技术#2的三个属性是位于内存中相邻,所以一个高速缓存行会给我所有三个项目的属性。事实上,因为高速缓存线是64个字节,我使用integers-会有每个缓存线路负载提供16整数。这意味着我可以处理每缓存行16/3 = 5个项目。

However, with technique #2 the three attributes are located adjacent in memory, so one cache line will give me all three item attributes. In fact because the cache line is 64 bytes and I am using integers- there will be 16 integers available per cache line load. This means I can process 16/3 = 5 items per cache line.

在换句话说,技术#1应该至少需要三级缓存线路负荷处理5个项目,而技术#2将只需要一个高速缓存行负载。我期待技术#2是快了不少。

In other words, technique #1 should require at least 3 cache line loads to process 5 items whereas technique #2 would only require one cache line load. I was expecting technique #2 to be a lot faster.

结果(Win 7的64,MSVC 2012年台式机的Ivy Bridge CPU):

Results (Win 7 64, MSVC 2012 with desktop Ivy Bridge CPU):


  • 技巧#1:3000的CP​​U周期(平均)

  • 技巧#2:3400个CPU周期(平均)

技术#1 C ++:

unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];

//I have left out code for simplicity. You can assume by now the arrays are populated.

start = __rdtscp(&x);

for(int i=0; i < 700; i++){
    unsigned int a= array1[i];     //Array 1
    unsigned int b= array2[i];     //Array 2

    data_for_all_items = data_for_all_items & (a!= -1 & b != -1);

    unsigned int e = array3[i];    //Array 3

    c += (a * e);
    d += (b * e);
}

finish = __rdtscp(&y);

if(data_for_all_items){
    std::cout << finish - start << std::endl;
    //This line prevents compiler over-optimizing
    std::cout << c << d << std::endl;
}

大会技巧#1:

            unsigned int x;
            unsigned int y;

            start = __rdtscp(&x);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[this]  

            for(int i=0; i < 700; i++){
 sub         rdi,rsi  
 or          rax,rdx  
 lea         r10,[rsi+4]  
 mov         ebx,46h  
 mov         dword ptr [r8],ecx  
 mov         r11,rax  
 sub         r9,rsi  
                unsigned int a = array1[i];
                unsigned int b = array2[i];

                all_instr_found = all_instr_found & (a != -1 & b != -1);
 cmp         dword ptr [r10-4],0FFFFFFFFh  
 xorps       xmm0,xmm0  
 setne       cl  
 cmp         dword ptr [r10+rdi-4],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 and         bpl,cl  

                unsigned int e = array3[i];
 mov         ecx,dword ptr [r10+r9-4]  


                c += (a * e);
 mov         eax,ecx  
                d += (b * e);
 imul        ecx,dword ptr [r10-4]  
 imul        eax,dword ptr [r10+rdi-4]  
 cmp         dword ptr [r10],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [rdi+r10],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r9+r10]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        ecx,dword ptr [r10]  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [rdi+r10]  
 cmp         dword ptr [r10+4],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [rdi+r10+4],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r9+r10+4]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        ecx,dword ptr [r10+4]  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [rdi+r10+4]  
 cmp         dword ptr [r10+8],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [rdi+r10+8],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+8]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        ecx,dword ptr [r10+8]  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [rdi+r10+8]  
 cmp         dword ptr [r10+0Ch],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [r10+rdi+0Ch],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+0Ch]  
 mov         eax,ecx  
                d += (b * e);
 addsd       xmm7,xmm0  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [r10+rdi+0Ch]  
 imul        ecx,dword ptr [r10+0Ch]  
 cvtsi2sd    xmm0,rax  
 addsd       xmm6,xmm0  
 cmp         dword ptr [r10+10h],0FFFFFFFFh  
 mov         eax,ecx  
 xorps       xmm0,xmm0  
 setne       cl  
 cmp         dword ptr [r10+rdi+10h],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+10h]  
 addsd       xmm7,xmm0  
 mov         eax,ecx  
 xorps       xmm0,xmm0  
 imul        ecx,dword ptr [r10+10h]  
 imul        eax,dword ptr [r10+rdi+10h]  
 cmp         dword ptr [r10+14h],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [r10+rdi+14h],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+14h]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        ecx,dword ptr [r10+14h]  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [r10+rdi+14h]  
 cmp         dword ptr [r10+18h],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [r10+rdi+18h],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+18h]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        ecx,dword ptr [r10+18h]  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [r10+rdi+18h]  
 cmp         dword ptr [r10+1Ch],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [r10+rdi+1Ch],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+1Ch]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        ecx,dword ptr [r10+1Ch]  
 xorps       xmm0,xmm0  
 imul        eax,dword ptr [r10+rdi+1Ch]  
 cmp         dword ptr [r10+20h],0FFFFFFFFh  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 setne       cl  
 cmp         dword ptr [r10+rdi+20h],0FFFFFFFFh  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 and         bpl,cl  
 mov         ecx,dword ptr [r10+r9+20h]  
 mov         eax,ecx  
 addsd       xmm7,xmm0  
 imul        eax,dword ptr [r10+rdi+20h]  
 imul        ecx,dword ptr [r10+20h]  
 xorps       xmm0,xmm0  
 add         r10,28h  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 addsd       xmm7,xmm0  
 dec         rbx  
 jne         013FC6DA71h  

            }

            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[y]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx

(它看起来像编译器展开循环为技术#1五次)

(It looks like the compiler was unrolling the loop five times for Technique #1)

技术#2 C ++:

unsigned int x;
unsigned int y;
unsigned short pos = 0;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];

//I have left out code for simplicity. You can assume by now the array is populated.

start = __rdtscp(&x);

while(pos < 2100){
    unsigned int a = array[pos + 0];     
    unsigned int b = array[pos + 1];     

    data_for_all_items = data_for_all_items & (a!= -1 & b != -1);

    unsigned int e = array[pos + 2];    

    c += (a * e);
    d += (b * e);
    pos = pos + 3;
}

finish = __rdtscp(&y);

if(data_for_all_items){
    std::cout << finish - start << std::endl;
    //This line prevents compiler over-optimizing
    std::cout << c << d << std::endl;
}

大会技巧#2:

            start = __rdtscp(&x);
 rdtscp  
 shl         rdx,20h  
 lea         r9,[this]  
 mov         r11d,2BCh  
 or          rax,rdx  
 mov         dword ptr [r9],ecx  
 add         r10,8  
 mov         rbx,rax  
 nop         word ptr [rax+rax]  

            while(pos < 2100){

                unsigned int a = array[pos];
                unsigned int b = array[pos + 1];

                all_instr_found = all_instr_found & (a != -1 & b != -1);
 cmp         dword ptr [r10-4],0FFFFFFFFh  
 xorps       xmm0,xmm0  
 setne       dl  
 cmp         dword ptr [r10-8],0FFFFFFFFh  
 setne       cl  

                unsigned int e = array[pos + 2];

                c += (a * e);
                d += (b * e);

                pos = pos + 3;
 add         r10,0Ch  
 and         dl,cl  
 mov         ecx,dword ptr [r10-0Ch]  
 mov         eax,ecx  
 and         r15b,dl  
 imul        ecx,dword ptr [r10-10h]  
 imul        eax,dword ptr [r10-14h]  
 cvtsi2sd    xmm0,rax  
 mov         eax,ecx  
 addsd       xmm6,xmm0  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 addsd       xmm7,xmm0  
 dec         r11  
 jne         013F21DA30h 
            }



            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[y]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx

因此​​,没有循环展开?这必须是这个原因吧?嗯,我手动展开使用这个C ++循环:

So no loop unrolling? This must be the reason right? Well I manually unrolled the loop using this C++:

    while(pos < 2100){
        unsigned int a1 = array[pos + 0];
        unsigned int b1 = array[pos + 1];
        unsigned int e1 = array[pos + 2];

        unsigned int a2 = array[pos + 3];
        unsigned int b2 = array[pos + 4];
        unsigned int e2 = array[pos + 5];

        unsigned int a3 = array[pos + 6];
        unsigned int b3 = array[pos + 7];
        unsigned int e3 = array[pos + 8];

        unsigned int a4 = array[pos + 9];
        unsigned int b4 = array[pos + 10];
        unsigned int e4 = array[pos + 11];

        unsigned int a5 = array[pos + 12];
        unsigned int b5 = array[pos + 13];
        unsigned int e5 = array[pos + 14];

        c += (a1 * e1) + (a2 * e2) + (a3 * e3) + (a4 * e4) + (a5 * e5);
        d += (b1 * e1) + (b2 * e2) + (b3 * e3) + (b4 * e4) + (b5 * e5);

        data_for_all_items = data_for_all_items & (a1 != -1 & b1 != -1) & (a2 != -1 & b2 != -1) & (a3 != -1 & b3 != -1) & (a4 != -1 & b4 != -1) & (a5 != -1 & b5 != -1);

        pos += 15;
    }

    if(data_for_all_items){
        std::cout << finish - start << std::endl;
        //This line prevents compiler over-optimizing
        std::cout << c << d << std::endl;
    }

生成此组件:

           start = __rdtscp(&x);
 rdtscp  
 lea         r9,[x]  
 shl         rdx,20h  
 mov         qword ptr [rsp+108h],8Ch  
 mov         dword ptr [r9],ecx  
 lea         r9,[r8+8]  
 or          rax,rdx  
 mov         qword ptr [y],r9  
 mov         qword ptr [rsp+20h],rax  
 nop  
                unsigned int a1 = array[pos + 0];
                unsigned int b1 = array[pos + 1];
                unsigned int e1 = array[pos + 2];
 mov         edi,dword ptr [r9]  

                unsigned int a2 = array[pos + 3];
 mov         r13d,dword ptr [r9+4]  
                unsigned int b2 = array[pos + 4];
 mov         r12d,dword ptr [r9+8]  
 xorps       xmm0,xmm0  
                unsigned int e2 = array[pos + 5];
 mov         r10d,dword ptr [r9+0Ch]  

                unsigned int a3 = array[pos + 6];
 mov         r15d,dword ptr [r9+10h]  
                unsigned int b3 = array[pos + 7];
 mov         r14d,dword ptr [r9+14h]  
                unsigned int e3 = array[pos + 8];
 mov         r8d,dword ptr [r9+18h]  

                unsigned int a4 = array[pos + 9];
 mov         ebp,dword ptr [r9+1Ch]  
                unsigned int b4 = array[pos + 10];
 mov         esi,dword ptr [r9+20h]  
                unsigned int e4 = array[pos + 11];
 mov         edx,dword ptr [r9+24h]  

                unsigned int a5 = array[pos + 12];
 mov         ebx,dword ptr [r9+28h]  
                unsigned int b5 = array[pos + 13];
 mov         r11d,dword ptr [r9+2Ch]  
                unsigned int e5 = array[pos + 14];
 mov         r9d,dword ptr [r9+30h]  

                c += (a1 * e1) + (a2 * e2) + (a3 * e3) + (a4 * e4) + (a5 * e5);
 mov         eax,edx  
 mov         dword ptr [x],r13d  
                d += (b1 * e1) + (b2 * e2) + (b3 * e3) + (b4 * e4) + (b5 * e5);
 imul        edx,esi  
 imul        eax,ebp  
 mov         ecx,r9d  
 imul        r9d,r11d  
 imul        ecx,ebx  
 add         r9d,edx  
 add         ecx,eax  
 mov         eax,r8d  
 imul        r8d,r14d  
 imul        eax,r15d  
 add         r9d,r8d  

                all_instr_found = all_instr_found & (a1 != -1 & b1 != -1) & (a2 != -1 & b2 != -1) & (a3 != -1 & b3 != -1) & (a4 != -1 & b4 != -1) & (a5 != -1 & b5 != -1);
 movzx       r8d,byte ptr [this]  
 add         ecx,eax  
 mov         eax,r10d  
 imul        r10d,r12d  
 add         r9d,r10d  
 imul        eax,r13d  
 mov         r13,qword ptr [y]  
 add         ecx,eax  
 mov         eax,edi  
 imul        eax,dword ptr [r13-8]  
 add         eax,ecx  
 cvtsi2sd    xmm0,rax  
 mov         rax,r13  
 mov         edx,dword ptr [rax-4]  
 imul        edi,edx  
 cmp         r11d,0FFFFFFFFh  
 addsd       xmm6,xmm0  
 setne       cl  

                all_instr_found = all_instr_found & (a1 != -1 & b1 != -1) & (a2 != -1 & b2 != -1) & (a3 != -1 & b3 != -1) & (a4 != -1 & b4 != -1) & (a5 != -1 & b5 != -1);
 cmp         ebx,0FFFFFFFFh  
 lea         eax,[rdi+r9]  
 mov         r9,r13  
 xorps       xmm0,xmm0  
 cvtsi2sd    xmm0,rax  
 setne       al  
 and         cl,al  
 cmp         esi,0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         ebp,0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         r14d,0FFFFFFFFh  
 addsd       xmm7,xmm0  
 setne       al  
 and         cl,al  
 cmp         r15d,0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         r12d,0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [x],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         edx,0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [r13-8],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 and         r8b,cl  

                pos += 15;
 add         r9,3Ch  
 mov         qword ptr [y],r9  
 mov         byte ptr [this],r8b  


            while(pos < 2100){
 dec         qword ptr [rsp+108h]  
 jne         013F31DA50h  
            }


            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r9,[y]  
 mov         dword ptr [r9],ecx  
 mov         r15,qword ptr [rsp+0A8h]  
 mov         r13,qword ptr [rsp+0B0h]  
 mov         r12,qword ptr [rsp+0B8h]  
 or          rax,rdx  

新的人工循环展开平均需要3500个CPU周期,比未推出原始版本这是3400较慢的CPU周期。

The new manual loop-unrolling takes 3500 CPU cycles on average, slower than the un-rolled original version which was 3400 CPU cycles.

能否有人请解释为什么code应该从缓存中提取的3倍少慢,是否有办法帮助编译器或得到这个code比技术#1?

Can somebody please explain why code which should fetch from the cache 3x less is slower and whether there is a way to help the compiler or get this code faster than technique #1?

推荐答案

使用一些prefetching在#2,你可能只是实现这一远大目标的性能。结果
对于GCC,使用内联汇编器和3DNOW prefetch指令(几乎所有从现在延伸离开)。

Use some prefetching in #2, and you might just reach that lofty performance goal.
For gcc, use inline assembler and the 3dNow Prefetch instructions (just about all left from that extension nowadays).

您例如排名第一的可能运行得更快,因为这样做3个独立的内存请求(因乱顺序执行)是指内存总线上更好的吞吐量(减少空闲时间)。

Your example number one probably runs faster because doing 3 independent memory requests (due to out-of-order-execution) means better throughput on the memory bus (less idle time).

这是微优化,不仅使得事情更难你,但作为弄巧成拙一个很好的例子。结果
处理器设计者和制造商,以及编译器编写者花真是太神奇了大量的金钱,时间和人才,为常见的情况,以便优化的猜测它们是困难的。

That's a nice example for micro-optimisation not only making things harder on you, but being self-defeating.
The processor designer and maker as well as the compiler writers spend really amazing amounts of money, time and talent optimizing for the common case so second-guessing them is hard.

这篇关于code连续存取L1缓存慢于code访问3X非连续L1存储器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆