AVX中的矩阵矢量乘法没有按比例地比SSE中的快 [英] Matrix-vector-multiplication in AVX not proportionately faster than in SSE

查看：451 发布时间：2020/5/7 19:44:39 c++ vectorization sse matrix-multiplication avx

本文介绍了AVX中的矩阵矢量乘法没有按比例地比SSE中的快的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用以下方法在SSE和AVX中编写矩阵向量乘法:

I was writing a matrix-vector-multiplication in both SSE and AVX using the following:

for(size_t i=0;i<M;i++) {
    size_t index = i*N;
    __m128 a, x, r1;
    __m128 sum = _mm_setzero_ps();
    for(size_t j=0;j<N;j+=4,index+=4) {
         a = _mm_load_ps(&A[index]);
         x = _mm_load_ps(&X[j]);
         r1 = _mm_mul_ps(a,x);
         sum = _mm_add_ps(r1,sum);
    }
    sum = _mm_hadd_ps(sum,sum);
    sum = _mm_hadd_ps(sum,sum);
    _mm_store_ss(&C[i],sum);
}

我对AVX使用了类似的方法，但是最后，由于AVX没有与_mm_store_ss()等效的指令，我使用了:

I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent instruction to _mm_store_ss(), I used:

_mm_store_ss(&C[i],_mm256_castps256_ps128(sum));

SSE代码使我的串行代码加速了3.7.但是，AVX代码比串行代码仅使我加速4.3.

The SSE code gives me a speedup of 3.7 over the serial code. However, the AVX code gives me a speedup of only 4.3 over the serial code.

我知道在AVX上使用SSE可能会引起问题，但是我使用g ++使用-mavx'标志对其进行了编译，这应该删除SSE操作码.

I know that using SSE with AVX can cause problems but I compiled it with the -mavx' flag using g++ which should remove the SSE opcodes.

我还可以使用:_mm256_storeu_ps(&C[i],sum)来做同样的事情，但是加速是一样的.

I could have also used: _mm256_storeu_ps(&C[i],sum) to do the same thing, but the speedup is the same.

关于我还可以做些什么来提高性能的任何见解?是否与以下内容相关: performance_memory_bound ，尽管我不清楚该线程的答案.

Any insights as to what else I could be doing to improve performance? Can it be related to : performance_memory_bound, though I didn't understand the answer on that thread clearly.

此外，即使包含"immintrin.h"头文件，我也无法使用_mm_fmadd_ps()指令.我同时启用了FMA和AVX.

Also, I am not able to use the _mm_fmadd_ps() instruction even by including "immintrin.h" header file. I have both FMA and AVX enabled.

AVX中的矩阵矢量乘法没有按比例地比SSE中的快 [英] Matrix-vector-multiplication in AVX not proportionately faster than in SSE

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

AVX中的矩阵矢量乘法没有按比例地比SSE中的快 [英] Matrix-vector-multiplication in AVX not proportionately faster than in SSE

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭