如何使用movntdqa避免缓存污染? [英] how to use movntdqa to avoid cache pollution?

查看:664
本文介绍了如何使用movntdqa避免缓存污染?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想写一个不来源存储器装载到CPU缓存memcpy函数。的目的是为了避免缓存污染。
下面的memcpy函数的作品,但像标准的memcpy确实污染缓存。我使用P8700 proccesoor用Visual C ++ 2008的前preSS。我看到英特尔VTune CPU的缓存使用。

 无效的memcpy(字符* DST,字符* SRC,无符号大小){
    字符* dst_end = DST +大小;
    而(DST!= dst_end){
    __m128i解析度= _mm_stream_load_si128((__ m128i *)SRC);
    *((__ m128i *)DST)=资源;
    SRC + = 16;
    DST + = 16;
    }
}

我有另一个版本,具有相同的结果 - 工作但污染缓存

 无效的memcpy(字符* DST,字符* SRC,无符号大小){        字符* dst_end = DST +大小;        __asm​​ {
        MOV EDI,DST
        MOV EDX,dst_end
        MOV ESI,SRC
        inner_start:
        LFENCE
      MOVNTDQA XMM0,[ESI]
      MOVNTDQA将xmm1,[ESI + 16]
      MOVNTDQA XMM2,[ESI + 32]
      MOVNTDQA XMM3,[ESI + 48]
      // 19。 ;数据复制到缓冲区
      MOVDQA [EDI],XMM0
      MOVDQA [EDI + 16],将xmm1
      MOVDQA [EDI + 32],XMM2
      MOVDQA [EDI + 48],XMM3
    // 25;增量指针由高速缓存行的大小和测试循环结束
      加ESI,040H
      加入EDI,040H
      CMP EDI,EDX
      JNE inner_start
}
}

更新:这是测试程序

 无效测试(INT table_size,诠释num_iter,诠释item_size){
            字符* src_table = alloc_aligned(table_size * item_size); //返回值是在64字节对齐
            字符* DST = alloc_aligned(item_size); //目的地总是相同的缓冲液
            的for(int i = 0; I< num_iter;我++){
                INT位置= my_rand()%table_size;
                字符* SRC = src_table +位置* item_size; //选择不同的SRC每次
                的memcpy(DST,SRC,item_size);
            }        }
主要(){
       试验(1024 * 32,1024 * 1024,1024 * 32)
}


解决方案

从<一个报价href=\"http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load\">intel


  

的流式传输加载指令是
  为了加快数据传输
  从USWC内存类型。对于其他
  内存类型,如缓存(WB)或
  不可缓存的(UC),指令
  表现为典型的16字节MOVDQA
  加载指令。然而,未来
  处理器可能使用流负载
  对于其他类型的存储器指令
  (如世界银行)作为一个暗示,
  预期高速缓存行应流
  从存储器直接到核心而
  减少缓存污染。


这解释了为什么code不起作用 - 记忆的类型是世行

i am trying to write a memcpy function that does not load the source memory to the cpu cache. The purpose is to avoid cache pollution. The memcpy function below works, but pollutes the cache like the standard memcpy does. i am using P8700 proccesoor with visual C++ 2008 express. i see the cpu cache usage with intel vtune.

void memcpy(char *dst,char*src,unsigned size){
    char *dst_end=dst+size;
    while(dst!=dst_end){
    	__m128i res = _mm_stream_load_si128((__m128i *)src);
    	*((__m128i *)dst)=res;
    	src+=16;
    	dst+=16;
    }
}

i have another version, that have the same results - works but pollutes the cache.

void memcpy(char *dst,char*src,unsigned size){

        char *dst_end = dst+size;

        __asm{
        mov edi, dst 
        mov edx, dst_end 
        mov esi,src
        inner_start: 
        LFENCE 
      MOVNTDQA xmm0,    [esi ]
      MOVNTDQA xmm1, [esi+16] 
      MOVNTDQA xmm2, [esi+32] 
      MOVNTDQA xmm3, [esi+48] 
      //19. ; Copy data to buffer 
      MOVDQA [edi], xmm0 
      MOVDQA  [edi+16], xmm1 
      MOVDQA  [edi+32], xmm2 
      MOVDQA  [edi+48], xmm3 
    //  25. ; Increment pointers by cache line size and test for end of loop 
      add esi, 040h 
      add edi, 040h 
      cmp edi, edx 
      jne inner_start 


}
}

update: this is the test program

        void test(int table_size,int num_iter,int item_size){
            char *src_table=alloc_aligned(table_size*item_size);//return value is aligned on 64 bytes
            char *dst=alloc_aligned(item_size); //destination is always the same buffer
            for (int i=0;i<num_iter;i++){
                int location=my_rand()%table_size;
                char *src=src_table+location*item_size;//selecting a different src every time
                memcpy(dst,src,item_size);
            }

        }
main(){
       test(1024*32,1024*1024,1024*32)
}

解决方案

quoting from intel

"The streaming load instruction is intended to accelerate data transfers from the USWC memory type. For other memory types such as cacheable (WB) or Uncacheable (UC), the instruction behaves as a typical 16-byte MOVDQA load instruction. However, future processors may use the streaming load instruction for other memory types (such as WB) as a hint that the intended cache line should be streamed from memory directly to the core while minimizing cache pollution."

that explains why the code does not work - the memory is of type wb.

这篇关于如何使用movntdqa避免缓存污染?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆