AVX中的矩阵矢量乘法没有按比例地比SSE中的快 [英] Matrix-vector-multiplication in AVX not proportionately faster than in SSE

查看:451
本文介绍了AVX中的矩阵矢量乘法没有按比例地比SSE中的快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下方法在SSE和AVX中编写矩阵向量乘法:

I was writing a matrix-vector-multiplication in both SSE and AVX using the following:

for(size_t i=0;i<M;i++) {
    size_t index = i*N;
    __m128 a, x, r1;
    __m128 sum = _mm_setzero_ps();
    for(size_t j=0;j<N;j+=4,index+=4) {
         a = _mm_load_ps(&A[index]);
         x = _mm_load_ps(&X[j]);
         r1 = _mm_mul_ps(a,x);
         sum = _mm_add_ps(r1,sum);
    }
    sum = _mm_hadd_ps(sum,sum);
    sum = _mm_hadd_ps(sum,sum);
    _mm_store_ss(&C[i],sum);
}

我对AVX使用了类似的方法,但是最后,由于AVX没有与_mm_store_ss()等效的指令,我使用了:

I used a similar method for AVX, however at the end, since AVX doesn't have an equivalent instruction to _mm_store_ss(), I used:

_mm_store_ss(&C[i],_mm256_castps256_ps128(sum));

SSE代码使我的串行代码加速了3.7.但是,AVX代码比串行代码仅使我加速4.3.

The SSE code gives me a speedup of 3.7 over the serial code. However, the AVX code gives me a speedup of only 4.3 over the serial code.

我知道在AVX上使用SSE可能会引起问题,但是我使用g ++使用-mavx'标志对其进行了编译,这应该删除SSE操作码.

I know that using SSE with AVX can cause problems but I compiled it with the -mavx' flag using g++ which should remove the SSE opcodes.

我还可以使用:_mm256_storeu_ps(&C[i],sum)来做同样的事情,但是加速是一样的.

I could have also used: _mm256_storeu_ps(&C[i],sum) to do the same thing, but the speedup is the same.

关于我还可以做些什么来提高性能的任何见解?是否与以下内容相关: performance_memory_bound ,尽管我不清楚该线程的答案.

Any insights as to what else I could be doing to improve performance? Can it be related to : performance_memory_bound, though I didn't understand the answer on that thread clearly.

此外,即使包含"immintrin.h"头文件,我也无法使用_mm_fmadd_ps()指令.我同时启用了FMA和AVX.

Also, I am not able to use the _mm_fmadd_ps() instruction even by including "immintrin.h" header file. I have both FMA and AVX enabled.

推荐答案

我建议您重新考虑算法.查看讨论高效的4x4矩阵矢量乘法SSE:水平加和点积-有什么意义?

I suggest you reconsider your algorithm. See the discussion Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

您正在做一个长点积,每次迭代使用_mm_hadd_ps.相反,您应该使用SSE一次制作四个点积(使用AVX一次制作八个),并且只能使用垂直运算符.

You're doing one long dot product and using _mm_hadd_ps per iteration. Instead you should do four dot products at once with SSE (eight with AVX) and only use vertical operators.

您需要加法,乘法和广播.可以使用_mm_add_ps_mm_mul_ps_mm_shuffle_ps(对于广播)在SSE中完成所有操作.

You need addition, multiplication, and a broadcast. This can all be done in SSE with _mm_add_ps, _mm_mul_ps, and _mm_shuffle_ps (for the broadcast).

如果您已经有了矩阵的转置,这真的很简单.

If you already have the transpose of the matrix this is really simple.

但是无论是否有转置,都需要使代码对缓存更友好.为了解决这个问题,我建议对矩阵进行循环平铺.查看此讨论最快的转置方式是什么一个C ++矩阵?了解如何进行循环平铺.

But whether you have the transpose or not you need to make your code more cache friendly. To fix this I suggest loop tiling of the matrix. See this discussion What is the fastest way to transpose a matrix in C++? to get an idea on how to do loop tiling.

在尝试使用SSE/AVX之前,我会先尝试使循环平铺.我在矩阵乘法中获得的最大提升不是来自SIMD,也不是来自循环切片.我认为,如果正确使用缓存,与SSE相比,AVX代码的线性度也会更高.

I would try and get the loop tiling right first before even trying SSE/AVX. The biggest boost I got in my matrix multiplication was not from SIMD or threading it was from loop tiling. I think if you get the cache usage right your AVX code will perform more linear compared to SSE as well.

这篇关于AVX中的矩阵矢量乘法没有按比例地比SSE中的快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆