在标量矩阵加法中使用vaddss而不是addss有什么好处? [英] What is the benefits of using vaddss instead of addss in scalar matrix addition?

查看:691
本文介绍了在标量矩阵加法中使用vaddss而不是addss有什么好处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经实现了标量矩阵加法内核。

  #include< stdio.h> 
#include< time.h>
//#include< x86intrin.h>

//循环和迭代:
#define N 128
#define MN
#define NUM_LOOP 1000000


float __attribute __((aligned(32)))A [N] [M],
__attribute __((aligned(32)))B [N] [M],
__attribute __((aligned(32)) )C [N] [M];

int main()
{
int w = 0,i,j;
struct timespec tStart,tEnd; //用于记录处理时间
double tTotal,tBest = 10000; //最小的toltal时间将被设置为最佳时间
do {
clock_gettime(CLOCK_MONOTONIC,& tStart); (j = 0; j C [i] [j] = {

$ b对于(i = 0; i }
}

clock_gettime(CLOCK_MONOTONIC,& tEnd);
tTotal =(tEnd.tv_sec - tStart.tv_sec);
tTotal + =(tEnd.tv_nsec - tStart.tv_nsec)/ 1000000000.0;
if(tTotal tBest = tTotal;
} while(w ++< NUM_LOOP);

printf(最佳时间:%df秒%d%重复%dX%d矩阵\ n,tBest,w,N,M);
返回0;
}

在这种情况下,我使用不同的编译器标志编译了程序,内部循环的程序集输出如下:

gcc -O2 msse4.2 :最佳时间:0.000024秒为406490重复128X128矩阵

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $' PTR B [rcx + rax]
movss DWORD PTR C [rcx + rax],xmm1

gcc -O2 -mavx :最佳时间:1000001秒内为0.000009秒,为128X128矩阵

  vmovss xmm1,DWORD PTR A [rcx + rax] 
vaddss xmm1,xmm1,DWORD PTR B [rcx + rax]
vmovss DWORD PTR C [rcx + rax],xmm1

AVX版本 gcc -O2 -mavx

  __ m256 vec256; (j = 0; j  vec256 = _mm256_add_ps(_mm256_load_ps(& A [ i + 1] [j]),_mm256_load_ps(& B [i + 1] [j])); 
_mm256_store_ps(& C [i + 1] [j],vec256);




$ b $ s $ gcc -O2 -sse4.2 ::

  __ m128 vec128; (j = 0; j  vec128 = _mm_add_ps(_mm_load_ps(& A [ i] [j]),_mm_load_ps(& B [i] [j])); 
_mm_store_ps(& C [i] [j],vec128);




$ b $在标量程序中加速 -mavx over msse4.2 为2.7x。我知道 avx 有效地改进了ISA,可能是因为这些改进。但是,当我在 AVX SSE 的内在函数中实现该程序时,加速比是3倍。问题是:AVX标量比SSE快2.7倍,当我向量化它的速度是3倍(这个问题的矩阵大小是128x128)。这是否有意义在标量模式下使用AVX和SSE时,速度为2.7倍。但矢量化方法必须更好,因为我处理AVX中的八个元素,而不是SSE中的四个元素。所有程序的缓存未命中率都小于4.5%,因为 perf stat 报告过。



使用 gcc -O2 linux mint skylake



更新:简单地说,Scalar-AVX比Scalar-SSE快2.7倍,但AVX-256在矢量化时比SSE-128快3倍。我认为这可能是因为流水线。在标量中,我有3 vec-ALU ,这可能无法在向量化模式下使用。我可能会把苹果比作苹果,而不是苹果,苹果可能是我无法理解的原因。

解决方案

您正在观察的问题解释为此处。在Skylake系统上,如果AVX寄存器的上半部分脏,则AVX寄存器上半部分的非vex编码SSE操作存在错误的依赖关系。在你的情况下,你的glibc 2.23版本似乎有一个错误。在使用Ubuntu 16.10和glibc 2.24的Skylake系统上,我没有问题。您可以使用

  __ asm__ __volatile__(vzeroupper:::); 

清除AVX寄存器的上半部分。我不认为你可以使用像 _mm256_zeroupper 这样的内在函数来解决这个问题,因为GCC会说它是SSE代码而不能识别内在的。选项 -mvzeroupper 将不起作用,因为GCC再次认为它是SSE代码,不会发出 vzeroupper 指令。

顺便说一句,这是微软的错,硬件有这个问题






更新:

其他人在Skylake上显然遇到这个问题。已经在 printf 之后观察到, memset clock_gettime -mprefer-avx128 -mavx 。 code>(这在AMD上特别有用)。但是,你会比较AVX256与AVX128,而不是AVX256与SSE。 AVX128和SSE都使用128位操作,但其实现方式不同。如果你基准测试你应该提及你使用哪一个。


I have implemented scalar matrix addition kernel.

#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>

//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000


float   __attribute__(( aligned(32))) A[N][M],
        __attribute__(( aligned(32))) B[N][M],
        __attribute__(( aligned(32))) C[N][M];

int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
    clock_gettime(CLOCK_MONOTONIC,&tStart);

    for( i=0;i<N;i++){
        for(j=0;j<M;j++){
            C[i][j]= A[i][j] + B[i][j];
        }
    }

    clock_gettime(CLOCK_MONOTONIC,&tEnd);
    tTotal = (tEnd.tv_sec - tStart.tv_sec);
    tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
    if(tTotal<tBest)
        tBest=tTotal;
    } while(w++ < NUM_LOOP);

printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}

In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:

gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix

movss   xmm1, DWORD PTR A[rcx+rax]
addss   xmm1, DWORD PTR B[rcx+rax]
movss   DWORD PTR C[rcx+rax], xmm1

gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix

vmovss  xmm1, DWORD PTR A[rcx+rax]
vaddss  xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss  DWORD PTR C[rcx+rax], xmm1

AVX version gcc -O2 -mavx:

__m256 vec256;
for(i=0;i<N;i++){   
    for(j=0;j<M;j+=8){
        vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) ,  _mm256_load_ps(&B[i+1][j]));
        _mm256_store_ps(&C[i+1][j], vec256);
            }
        }

SSE version gcc -O2 -sse4.2::

__m128 vec128;
for(i=0;i<N;i++){   
    for(j=0;j<M;j+=4){
    vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) ,  _mm_load_ps(&B[i][j]));
    _mm_store_ps(&C[i][j], vec128);
            }
        }

In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.

using gcc -O2 , linux mint, skylake

UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.

解决方案

The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem. You can use

__asm__ __volatile__ ( "vzeroupper" : : : ); 

to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction.

BTW, it's Microsoft's fault that the hardware has this problem.


Update:

Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.

If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.

这篇关于在标量矩阵加法中使用vaddss而不是addss有什么好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆