非常快速memcpy的图像处理? [英] Very fast memcpy for image processing?

查看:1217
本文介绍了非常快速memcpy的图像处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做的图像处理用C,要求各地复制存储大量的数据 - 在源和目标从未重叠

I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.

什么是使用要做到这一点在x86平台上的绝对最快的方式 GCC (其中<一个HREF =htt​​p://en.wikipedia.org/wiki/Streaming%5FSIMD%5FExtensions> SSE ,SSE2但不是SSE3可用)?

What is the absolute fastest way to do this on the x86 platform using GCC (where SSE, SSE2 but NOT SSE3 are available)?

我期望的解决方案要么是在组装或使用GCC内联函数?

I expect the solution will either be in assembly or using GCC intrinsics?

我发现下面的链接,但不知道它是否去它(笔者也表示有一些错误)的最佳方式:<一href=\"http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html\">http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html

I found the following link but have no idea whether it's the best way to go about it (the author also says it has a few bugs): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html

编辑:注意,副本是必要的,我不能让周围有复制数据(我可以解释为什么,但我会免去你的解释:))

note that a copy is necessary, I cannot get around having to copy the data (I could explain why but I'll spare you the explanation :))

推荐答案

陈伟霆礼貌和谷歌。比在Microsoft Visual Studio 2005中的memcpy快30-70%。

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{

  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer

    mov ebx, size;   //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)


    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];

      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];

      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;

      add esi, 128;
      add edi, 128;
      dec ebx;

      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

您可能能够优化进一步根据您的具体情况,你可以做任何假设。

You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.

您可能还需要检查出的memcpy源(memcpy.asm)并去掉其特殊办案。它可能进一步优化!

You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

这篇关于非常快速memcpy的图像处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆