具有SIMD的向量的点积 [英] Dot Product of Vectors with SIMD

查看:167
本文介绍了具有SIMD的向量的点积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用SIMD指令来加快我的C代码中的点积计算.但是,我的函数的运行时间大致相等.如果有人能解释为什么以及如何加快计算速度,那就太好了.

I am attempting to use SIMD instructions to speed up a dot product calculation in my C code. However, the run times of my functions are approximately equal. It'd be great if someone could explain why and how to speed up the calculation.

具体来说,我正在尝试计算其中包含约10,000个元素的两个数组的点积.我的常规C函数如下:

Specifically, I'm attempting to calculate the dot product of two arrays with about 10,000 elements in them. My regular C function is as follows:

 float my_dotProd( float const * const x, float const * const y, size_t const N ){
   // N is the number of elements in the arrays
   size_t i;
   float out=0;

   for( i=0; i < N; ++i ){
     out += x[i] * y[i];
   }

   return out;
 }

使用AVX SIMD命令的功能如下:

My function where I use AVX SIMD commands is as follows:

 void my_malloc( size_t nBytes, void ** ptrPtr ){
   int boundary = 32;
   posix_memalign( ptrPtr, boundary, nBytes );
 }

 float cimpl_sum_m128( __m128 x ){
   float out;
   __m128 sum = x;
   sum = _mm_hadd_ps( sum, sum );
   sum = _mm_hadd_ps( sum, sum );
   out = _mm_cvtss_f32( sum );
   return out;
 }

 float my_sum_m256( __m256 x ){
   float out1, out2;
   __m128 hi = _mm256_extractf128_ps(x, 1);
   __m128 lo = _mm256_extractf128_ps(x, 0);
   out1 = cimpl_sum_m128( hi );
   out2 = cimpl_sum_m128( lo );
   return out1 + out2;
 }

 float my_dotProd( float const * const x, float const * const y, size_t const N ){
   // N is the number of elements in the arrays
   size_t i=0;
   float out=0;
   float *tmp;

   __m256 summed, *l, *r;

   if( N > 7 ){
     my_malloc( sizeof(float) * 8, (void**) &tmp );
     summed = _mm256_set1_ps(0.0f);
     l = (__m256*) x;
     r = (__m256*) y;

     for( i=0; i < N-7; i+=8, ++l, ++r ){
       summed = _mm256_add_ps( summed, _mm256_mul_ps( *l, *r ) );
     }
     _mm256_store_ps( tmp, summed );

     out += my_sum_m256( summed );
     free( tmp );
   }

   for( ; i < N; ++i ){
     out += x[i] * y[i];
   }

   return out;
 }

我的测试例程是:

 int test_dotProd(){
   float *x, *y;
   size_t i, N;
   float answer, result;
   float err;

   N = 100000;  // Fails

   my_malloc( sizeof(float) * N, (void**) &x );
   my_malloc( sizeof(float) * N, (void**) &y );

   answer = 0;
   for( i=0; i<N; ++i ){
     x[i]=i; y[i]=i;
     answer += (float)i * (float)i;
   }

   result = my_dotProd( x, y, N );

   err = fabs( result - answer ) / answer;

   free( x );
   free( y );
   return err < 5e-7;
 }

我正在使用时钟来测量运行时,如下所示:

And I'm using clock to measure the runtime like this:

 timeStart = clock();
 testStatus = test_dotProd();
 timeTaken = (int)( clock() - timeStart );

我意识到可以提高my_sum_m256操作的效率,但是我认为这对运行时的影响应该很小.我猜想SIMD代码的速度大约是八倍.有什么想法吗?

I realize that the my_sum_m256 operation could be made more efficient, but I think that should be a small effect on the runtime. I would have guessed that the SIMD code would have been about eight times as fast. Any thoughts?

谢谢大家的帮助:)

推荐答案

首先:您不应该假设您可以比编译器更好地进行优化.

First of all: You shouldn't assume that you can optimize better than the compiler could.

是的,您现在正在优化"代码中使用AVX指令.但是您还编写了代码,除了普通的矢量化之外,编译器现在还遇到了一些问题.

Yes, you are now using AVX instructions in your "optimized" code. But you also wrote code which the compiler now has issues unrolling, in addition to the plain vectorization.

为进行比较,让我们看一下编译器从慢速" C实现中实际产生的效果,只是没有页脚的热循环.

For comparison, let's have a look at what a compiler would actually make from your "slow" C implementation, just the hot loop without footer.

ICC,使用-O3 -march=skylake -ffast-math 编译:

..B1.13:
    vmovups   ymm2, YMMWORD PTR [rsi+rdi*4]
    vmovups   ymm3, YMMWORD PTR [32+rsi+rdi*4]
    vfmadd231ps ymm1, ymm2, YMMWORD PTR [r8+rdi*4]
    vfmadd231ps ymm0, ymm3, YMMWORD PTR [32+r8+rdi*4]
    add       rdi, 16
    cmp       rdi, rax
    jb        ..B1.13

使用相同参数的Clang 更为悲观,并将其展开至以下内容:

Clang, with the same parameters is even more pessimistic and unrolls this to the following:

.LBB0_4:
    vmovups ymm4, ymmword ptr [rsi + 4*rcx]
    vmovups ymm5, ymmword ptr [rsi + 4*rcx + 32]
    vmovups ymm6, ymmword ptr [rsi + 4*rcx + 64]
    vmovups ymm7, ymmword ptr [rsi + 4*rcx + 96]
    vfmadd132ps     ymm4, ymm0, ymmword ptr [rdi + 4*rcx]
    vfmadd132ps     ymm5, ymm1, ymmword ptr [rdi + 4*rcx + 32]
    vfmadd132ps     ymm6, ymm2, ymmword ptr [rdi + 4*rcx + 64]
    vfmadd132ps     ymm7, ymm3, ymmword ptr [rdi + 4*rcx + 96]
    vmovups ymm0, ymmword ptr [rsi + 4*rcx + 128]
    vmovups ymm1, ymmword ptr [rsi + 4*rcx + 160]
    vmovups ymm2, ymmword ptr [rsi + 4*rcx + 192]
    vmovups ymm3, ymmword ptr [rsi + 4*rcx + 224]
    vfmadd132ps     ymm0, ymm4, ymmword ptr [rdi + 4*rcx + 128]
    vfmadd132ps     ymm1, ymm5, ymmword ptr [rdi + 4*rcx + 160]
    vfmadd132ps     ymm2, ymm6, ymmword ptr [rdi + 4*rcx + 192]
    vfmadd132ps     ymm3, ymm7, ymmword ptr [rdi + 4*rcx + 224]
    add     rcx, 64
    add     rax, 2
    jne     .LBB0_4

惊奇的是,两个编译器都已经能够使用AVX指令,而无需进行内在的黑客攻击.

Surprise, both compilers were already able to use AVX instructions, no intrinsic hacking required.

但是更有趣的是,两个编译器都认为一个累加寄存器不足以使AVX管线饱和,而是使用了2个或4个累加寄存器.进行更多操作有助于掩盖FMA的延迟,直至达到实际的内存吞吐量限制.

But what's more interesting, is that both compiler decided that one accumulation register wasn't enough to saturate the AVX pipeline, but rather used 2 respectively 4 accumulation registers. Having more operations in flight helps masking latency of the FMA, up to the point where the actual memory throughput limit is reached.

请不要忘记-ffast-math编译器选项,否则将最终累加值从向量化循环中拉出来是不合法的.

Just don't forget the -ffast-math compiler option, without it wouldn't be legal to pull the final accumulation out of the vectorized loop.

GCC,也具有相同的选择,实际上是仅"的,与您的优化的"一样好"解决方案:

GCC, also with the same options, is actually "only" doing as good as your "optimized" solution:

.L7:
    add     r8, 1
    vmovaps ymm3, YMMWORD PTR [r9+rax]
    vfmadd231ps     ymm1, ymm3, YMMWORD PTR [rcx+rax]
    add     rax, 32
    cmp     r8, r10
    jb      .L7

但是,GCC在向循环添加头时还是比较聪明的,因此它可以在第一次加载时使用vmovaps(对齐的内存访问)而不是vmovups(未对齐的内存访问).

However GCC was still a little bit smarter in also adding a header to that loop, so it could use vmovaps (aligned memory access) instead of vmovups (unaligned memory access) for the first load.

为了完整起见,请使用纯AVX(-O3 -march=ivybridge -ffast-math):

For completeness, with pure AVX (-O3 -march=ivybridge -ffast-math):

ICC:

..B1.12:
    vmovups   xmm2, XMMWORD PTR [r8+rdi*4]
    vmovups   xmm5, XMMWORD PTR [32+r8+rdi*4]
    vinsertf128 ymm3, ymm2, XMMWORD PTR [16+r8+rdi*4], 1
    vinsertf128 ymm6, ymm5, XMMWORD PTR [48+r8+rdi*4], 1
    vmulps    ymm4, ymm3, YMMWORD PTR [rsi+rdi*4]
    vmulps    ymm7, ymm6, YMMWORD PTR [32+rsi+rdi*4]
    vaddps    ymm1, ymm1, ymm4
    vaddps    ymm0, ymm0, ymm7
    add       rdi, 16
    cmp       rdi, rax
    jb        ..B1.12

C语:

.LBB0_5:
    vmovups xmm4, xmmword ptr [rdi + 4*rcx]
    vmovups xmm5, xmmword ptr [rdi + 4*rcx + 32]
    vmovups xmm6, xmmword ptr [rdi + 4*rcx + 64]
    vmovups xmm7, xmmword ptr [rdi + 4*rcx + 96]
    vinsertf128     ymm4, ymm4, xmmword ptr [rdi + 4*rcx + 16], 1
    vinsertf128     ymm5, ymm5, xmmword ptr [rdi + 4*rcx + 48], 1
    vinsertf128     ymm6, ymm6, xmmword ptr [rdi + 4*rcx + 80], 1
    vinsertf128     ymm7, ymm7, xmmword ptr [rdi + 4*rcx + 112], 1
    vmovups xmm8, xmmword ptr [rsi + 4*rcx]
    vmovups xmm9, xmmword ptr [rsi + 4*rcx + 32]
    vmovups xmm10, xmmword ptr [rsi + 4*rcx + 64]
    vmovups xmm11, xmmword ptr [rsi + 4*rcx + 96]
    vinsertf128     ymm8, ymm8, xmmword ptr [rsi + 4*rcx + 16], 1
    vmulps  ymm4, ymm8, ymm4
    vaddps  ymm0, ymm4, ymm0
    vinsertf128     ymm4, ymm9, xmmword ptr [rsi + 4*rcx + 48], 1
    vmulps  ymm4, ymm4, ymm5
    vaddps  ymm1, ymm4, ymm1
    vinsertf128     ymm4, ymm10, xmmword ptr [rsi + 4*rcx + 80], 1
    vmulps  ymm4, ymm4, ymm6
    vaddps  ymm2, ymm4, ymm2
    vinsertf128     ymm4, ymm11, xmmword ptr [rsi + 4*rcx + 112], 1
    vmulps  ymm4, ymm4, ymm7
    vaddps  ymm3, ymm4, ymm3
    add     rcx, 32
    cmp     rax, rcx
    jne     .LBB0_5

海湾合作委员会:

.L5:
    vmovups xmm3, XMMWORD PTR [rdi+rax]
    vinsertf128     ymm1, ymm3, XMMWORD PTR [rdi+16+rax], 0x1
    vmovups xmm4, XMMWORD PTR [rsi+rax]
    vinsertf128     ymm2, ymm4, XMMWORD PTR [rsi+16+rax], 0x1
    add     rax, 32
    vmulps  ymm1, ymm1, ymm2
    vaddps  ymm0, ymm0, ymm1
    cmp     rax, rcx
    jne     .L5

应用了几乎相同的优化,由于缺少FMA,因此不建议使用其他一些操作,对于Ivy Bridge,不建议使用未对齐的256位负载.

Pretty much the same optimizations applied, just a few additional operations as FMA is missing and unaligned 256bit loads are not advisable for Ivy Bridge.

这篇关于具有SIMD的向量的点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆