Simd matmul 程序给出不同的数值结果 [英] Simd matmul program gives different numerical results

查看:28
本文介绍了Simd matmul 程序给出不同的数值结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 simd 内在函数在 C 中编写矩阵乘法.我很确定我的实现,但是当我执行时,从结果矩阵系数的第 5 位开始出现一些数值错误.

I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients.

REAL_T 只是一个带有 typedef 的浮点数

REAL_T is just a float with typedef

/* This is my matmul Version with simd, using floating simple precision*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  __m256 vA, vB, vC, vRes;
  for (i=0; i<n; i++){
    for (j=0; j<n; j++){  
      for (k=0; k<n; k= k+8){
        vA = _mm256_load_ps(&A[i*n+k]);
        vB = _mm256_loadu_ps(&B[k*n+j]);
        vC = _mm256_mul_ps(vA, vB);
        vC = _mm256_hadd_ps(vC, vC);
        vC = _mm256_hadd_ps(vC, vC);
        /*To get the resulting coefficient, after doing 2 hadds,
        I have to get the first and the last element of the resulting
        Vector vC*/
        C[i*n+j] += ((float )(vC[0])) + ((float )(vC[7]));
      } /* for k */
    } /* for j */
  } /* for i */
}
*/End of program

/*And this is the sequential Version*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  for (i=0; i<n; i++){ 
    for (j=0; j<n; j++){
      for (k=0; k<n; k++){
        C[i*n+j] +=  A[i*n+k] *  B[k*n+j];  
      } /* for k */
    } /* for j */
  } /* for i */  
}
/*End of program*/

/*The matrix are initialized as follows*/
  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++){
      *(A+i*n+j) = 1 / ((REAL_T) (i+j+1));
      *(B+i*n+j) = 1.0;
      *(C+i*n+j) = 1.0;
    }
/*End of initialization*/

测试的矩阵大小为 512*512.对于顺序版本,结果矩阵的左上角正方形给出:

The tested matrix are of size 512*512. For the sequential version, the top left square of the resulting matrix gives:

+6.916512e+01  +6.916512e+01  
+5.918460e+01  +5.918460e+01  

+7.946186e+00  +7.946186e+00  
+7.936391e+00  +7.936391e+00  

但是,对于 simd 版本,正方形是:

However, for the simd version, the square is:

+6.916510e+01  +6.916510e+01  
+5.918463e+01  +5.918463e+01  

+7.946147e+00  +7.946147e+00  
+7.936355e+00  +7.936355e+00 

如图所示,两个版本之间存在数字错误.任何帮助将不胜感激!

There is as shown a numerical error between the 2 versions. Any help would be really appreciated !

推荐答案

这看起来很正常;以不同的顺序添加数字会在临时变量中产生不同的舍入.

This looks normal; adding numbers in a different order produces different rounding in the temporaries.

FP 数学不是关联的;优化就好像它会改变结果.1 浮点加法和乘法是否关联?/是C 关联中的浮点运算?

变化量取决于数据.对于 float 而言,仅小数点后 5 位的差异似乎是合理的.

The amount of change depends on the data. Differences only in the 5th decimal place seems reasonable for float.

除非您采取特殊的数字预防措施,例如先将小数字相加,否则顺序结果并不是更正确",它们只是有不同的错误.

事实上,对于大型列表,使用多个累加器通常会提高精度,前提是您的数字都具有相似的量级.(理想情况下,每个由多个元素组成的多个 SIMD 向量,以隐藏 FP-add 或 FMA 延迟).https://en.wikipedia.org/wiki/Pairwise_summation 是一种将其提升到新水平的数值技术:对树中列表的子集求和,以避免将单个数组元素添加到更大的值.例如,请参阅 如何避免更少具有多列的 numpy 数组的精确总和

In fact, using multiple accumulators generally increases precision for large lists, assuming your numbers all have similar magnitude. (Ideally multiple SIMD vectors each composed of multiple elements, to hide FP-add or FMA latency). https://en.wikipedia.org/wiki/Pairwise_summation is a numerical technique that takes this to the next level: summing subsets of the list in a tree, to avoid adding single array elements to a much larger value. See for example, How to avoid less precise sum for numpy-arrays with multiple columns

使用固定数量的累加器(例如 8x __m256 = 64 float 累加器)可能会将预期错误减少 64 倍,而不是从 N 到 log N完全成对求和.

Using a fixed number of accumulators (e.g. 8x __m256 = 64 float accumulators) might reduce expected error by a factor of 64, instead of from N to log N for full pairwise summation.

脚注 1:关联性是并行化、SIMD 和多个累加器所必需的. 关联性给了我们并行性.但是交换性给出了什么?

在具有例如 4 周期延迟、每时钟 2 次吞吐量 FMA、SIMD 宽度为 8 个浮点数的机器上,即具有 AVX2 的 Skylake 系统,来自多个累加器的潜在加速为 4*2 = 8,*8 从 SIMD 宽度乘以内核数,与纯顺序版本相比,即使对于它可能不太准确而不仅仅是不同的问题.

On a machine with for example 4-cycle latency 2-per-clock throughput FMA, with a SIMD width of 8 floats, i.e. a Skylake system with AVX2, the potential speedup is 4*2 = 8 from multiple accumulators, * 8 from SIMD width, times number of cores, vs. a pure sequential version, even for problems where it might be less accurate instead of just different.

大多数人认为 8*8 = 64 是值得的!(理论上,您还可以在四核上并行化另一个可能为 4 的因子,假设对大型矩阵进行完美缩放).

Most people consider a factor of 8*8 = 64 worth it! (And you can in theory also parallelize for another factor of maybe 4 on a quad-core, assuming perfect scaling for large matrices).

您已经在使用 float 而不是 double 来提高性能.

You're already using float instead of double for performance.

另请参阅 为什么mulss 在 Haswell 上只需要 3 个周期,与 Agner 的指令表不同吗? 了解更多关于使用多个累加器来隐藏 FMA 延迟以减少的信息.

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about using multiple accumulators to hide FMA latency in a reduction, exposing that other factor of 8 speedup.

另外,not 在最内层循环中使用 hadd.垂直求和,并在循环结束时使用有效的归约.(在x86上进行水平浮点向量求和的最快方法).你真的很想避免让编译器在每一步都将你的向量提取为标量,这会破坏 SIMD 的大部分好处!除了 hadd 不值得用于 1 个向量的水平总和这一事实之外;在所有现有 CPU 上花费 2 次随机播放 + 一次常规 add.

Also, do not use hadd inside an inner-most loop. Sum vertically, and use an efficient reduction at the end of the loop. (Fastest way to do horizontal float vector sum on x86). You really really want to avoid having the compiler extract your vectors to scalar at every step, that defeats most of the benefit of SIMD! Besides the fact that hadd is not worth using for horizontal sums of 1 vector; it costs 2 shuffles + a regular add on all existing CPUs.

这篇关于Simd matmul 程序给出不同的数值结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆