Simd Matmul程序给出不同的数值结果 [英] Simd matmul program gives different numerical results

查看:71
本文介绍了Simd Matmul程序给出不同的数值结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用simd内在函数在C中编程矩阵乘法.我非常确定自己的实现方式,但是执行时,我会从所得矩阵系数的第5位开始出现一些数字错误.

REAL_T只是具有typedef的浮点数

/* This is my matmul Version with simd, using floating simple precision*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  __m256 vA, vB, vC, vRes;
  for (i=0; i<n; i++){
    for (j=0; j<n; j++){  
      for (k=0; k<n; k= k+8){
        vA = _mm256_load_ps(&A[i*n+k]);
        vB = _mm256_loadu_ps(&B[k*n+j]);
        vC = _mm256_mul_ps(vA, vB);
        vC = _mm256_hadd_ps(vC, vC);
        vC = _mm256_hadd_ps(vC, vC);
        /*To get the resulting coefficient, after doing 2 hadds,
        I have to get the first and the last element of the resulting
        Vector vC*/
        C[i*n+j] += ((float )(vC[0])) + ((float )(vC[7]));
      } /* for k */
    } /* for j */
  } /* for i */
}
*/End of program

/*And this is the sequential Version*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  for (i=0; i<n; i++){ 
    for (j=0; j<n; j++){
      for (k=0; k<n; k++){
        C[i*n+j] +=  A[i*n+k] *  B[k*n+j];  
      } /* for k */
    } /* for j */
  } /* for i */  
}
/*End of program*/

/*The matrix are initialized as follows*/
  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++){
      *(A+i*n+j) = 1 / ((REAL_T) (i+j+1));
      *(B+i*n+j) = 1.0;
      *(C+i*n+j) = 1.0;
    }
/*End of initialization*/

被测矩阵的尺寸为512 * 512. 对于顺序版本,结果矩阵的左上角正方形为:

+6.916512e+01  +6.916512e+01  
+5.918460e+01  +5.918460e+01  

+7.946186e+00  +7.946186e+00  
+7.936391e+00  +7.936391e+00  

但是,对于simd版本,正方形为:

+6.916510e+01  +6.916510e+01  
+5.918463e+01  +5.918463e+01  

+7.946147e+00  +7.946147e+00  
+7.936355e+00  +7.936355e+00 

如图所示,两个版本之间存在数值错误. 任何帮助将不胜感激!

解决方案

这看起来很正常;按不同的顺序添加数字会在临时位置产生不同的舍入.

FP数学不是关联的; 1 是C关联中的浮点运算?

更改量取决于数据.对于float,仅在小数点后第五位出现差异是合理的.


除非您采取特殊的数字预防措施(例如先将小数相加),否则顺序结果并不是更正确",而是有不同的错误.

实际上,假设您的数字都具有相似的幅度,通常使用多个累加器对大型列表进行提高精度. (理想情况下,每个SIMD向量都由多个元素组成,以隐藏FP-add或FMA延迟). https://en.wikipedia.org/wiki/Pairwise_summation 是一种将其带到下一个层次的数值技术:将列表中的子集加到树中,以避免将单个数组元素添加到更大的值.参见例如如何避免更少多列numpy数组的精确总和

使用固定数量的累加器(例如8x __m256 = 64 float个累加器)可能会将预期误差降低64倍,而不是从N到log N进行完全成对求和.


脚注1:关联对于并行化,SIMD和多个累加器是必需的. 为什么mulss在Haswell上仅需要3个周期,与Agner的指令表不同吗?

此外,不要在最内层的循环中使用hadd.垂直求和,并在循环结束时使用有效的归约法. (在x86上进行水平浮点向量求和的最快方法).您真的很想避免编译器在每一步都将您的向量提取为标量,这使SIMD的大部分优势都荡然无存!除了hadd不适合用于1个向量的水平和的事实外,还不建议使用hadd.在所有现有的CPU上花费2次混洗+常规的add.

I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients.

REAL_T is just a float with typedef

/* This is my matmul Version with simd, using floating simple precision*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  __m256 vA, vB, vC, vRes;
  for (i=0; i<n; i++){
    for (j=0; j<n; j++){  
      for (k=0; k<n; k= k+8){
        vA = _mm256_load_ps(&A[i*n+k]);
        vB = _mm256_loadu_ps(&B[k*n+j]);
        vC = _mm256_mul_ps(vA, vB);
        vC = _mm256_hadd_ps(vC, vC);
        vC = _mm256_hadd_ps(vC, vC);
        /*To get the resulting coefficient, after doing 2 hadds,
        I have to get the first and the last element of the resulting
        Vector vC*/
        C[i*n+j] += ((float )(vC[0])) + ((float )(vC[7]));
      } /* for k */
    } /* for j */
  } /* for i */
}
*/End of program

/*And this is the sequential Version*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  for (i=0; i<n; i++){ 
    for (j=0; j<n; j++){
      for (k=0; k<n; k++){
        C[i*n+j] +=  A[i*n+k] *  B[k*n+j];  
      } /* for k */
    } /* for j */
  } /* for i */  
}
/*End of program*/

/*The matrix are initialized as follows*/
  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++){
      *(A+i*n+j) = 1 / ((REAL_T) (i+j+1));
      *(B+i*n+j) = 1.0;
      *(C+i*n+j) = 1.0;
    }
/*End of initialization*/

The tested matrix are of size 512*512. For the sequential version, the top left square of the resulting matrix gives:

+6.916512e+01  +6.916512e+01  
+5.918460e+01  +5.918460e+01  

+7.946186e+00  +7.946186e+00  
+7.936391e+00  +7.936391e+00  

However, for the simd version, the square is:

+6.916510e+01  +6.916510e+01  
+5.918463e+01  +5.918463e+01  

+7.946147e+00  +7.946147e+00  
+7.936355e+00  +7.936355e+00 

There is as shown a numerical error between the 2 versions. Any help would be really appreciated !

解决方案

This looks normal; adding numbers in a different order produces different rounding in the temporaries.

FP math is not associative; optimizing as if it is will change the results.1 Is Floating point addition and multiplication associative? / Are floating point operations in C associative?

The amount of change depends on the data. Differences only in the 5th decimal place seems reasonable for float.


Unless you were taking special numerical precautions like adding up the small numbers first, the sequential-order result isn't "more correct", they just have different errors.

In fact, using multiple accumulators generally increases precision for large lists, assuming your numbers all have similar magnitude. (Ideally multiple SIMD vectors each composed of multiple elements, to hide FP-add or FMA latency). https://en.wikipedia.org/wiki/Pairwise_summation is a numerical technique that takes this to the next level: summing subsets of the list in a tree, to avoid adding single array elements to a much larger value. See for example, How to avoid less precise sum for numpy-arrays with multiple columns

Using a fixed number of accumulators (e.g. 8x __m256 = 64 float accumulators) might reduce expected error by a factor of 64, instead of from N to log N for full pairwise summation.


Footnote 1: Associativity is necessary for parallelization, and SIMD, and multiple accumulators. Associativity gives us parallelizability. But what does commutativity give?

On a machine with for example 4-cycle latency 2-per-clock throughput FMA, with a SIMD width of 8 floats, i.e. a Skylake system with AVX2, the potential speedup is 4*2 = 8 from multiple accumulators, * 8 from SIMD width, times number of cores, vs. a pure sequential version, even for problems where it might be less accurate instead of just different.

Most people consider a factor of 8*8 = 64 worth it! (And you can in theory also parallelize for another factor of maybe 4 on a quad-core, assuming perfect scaling for large matrices).

You're already using float instead of double for performance.

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about using multiple accumulators to hide FMA latency in a reduction, exposing that other factor of 8 speedup.

Also, do not use hadd inside an inner-most loop. Sum vertically, and use an efficient reduction at the end of the loop. (Fastest way to do horizontal float vector sum on x86). You really really want to avoid having the compiler extract your vectors to scalar at every step, that defeats most of the benefit of SIMD! Besides the fact that hadd is not worth using for horizontal sums of 1 vector; it costs 2 shuffles + a regular add on all existing CPUs.

这篇关于Simd Matmul程序给出不同的数值结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆