如何在C中使用SSE内在函数计算单向量点积 [英] How to Calculate single-vector Dot Product using SSE intrinsic functions in C
问题描述
我正在尝试将两个向量相乘,其中一个向量的每个元素都与另一个向量相同索引中的元素相乘.然后,我想对所得向量的所有元素求和以获得一个数字.例如,向量{1,2,3,4}和{5,6,7,8}的计算看起来像这样:
I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:
1*5 + 2*6 + 3*7 + 4*8
本质上,我正在计算两个向量的点积.我知道有一个SSE命令可以执行此操作,但是该命令没有与之关联的固有功能.在这一点上,我不想在我的C代码中编写内联汇编,所以我只想使用内部函数.这似乎是一种普遍的计算,所以我为自己在Google上找不到答案感到惊讶.
Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.
注意:我正在针对特定的微体系结构进行优化,该体系结构最多支持SSE 4.2.
Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.
推荐答案
如果您要对更长的向量进行点积运算,请在内循环内使用乘法和常规_mm_add_ps
(或FMA). /strong>保存水平总和,直到结束.
If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps
(or FMA) inside the inner loop. Save the horizontal sum until the end.
但是,如果您只做一对SIMD向量的点积:
But if you are doing a dot product of just a single pair of SIMD vectors:
GCC(至少4.3版)包括具有SSE4.1级内在函数的<smmintrin.h>
,包括单精度和双精度点乘积:
GCC (at least version 4.3) includes <smmintrin.h>
with SSE4.1 level intrinsics, including the single and double-precision dot products:
_mm_dp_ps (__m128 __X, __m128 __Y, const int __M);
_mm_dp_pd (__m128d __X, __m128d __Y, const int __M);
在Intel主流CPU(不是Atom/Silvermont)上,这些速度比手动执行多条指令要快.
On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.
但是在AMD(包括Ryzen)上,dpps
明显慢一些. (请参见阿格纳·雾的说明表)
But on AMD (including Ryzen), dpps
is significantly slower. (See Agner Fog's instruction tables)
作为旧处理器的后备,您可以使用此算法来创建向量a
和b
的点积:
As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a
and b
:
__m128 r1 = _mm_mul_ps(a, b);
and then horizontal sum r1
using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)
__m128 shuf = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1));
__m128 sums = _mm_add_ps(r1, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
float result = _mm_cvtss_f32(sums);
慢速替代品每个hadd
会花费2个洗牌次数,这很容易成为洗牌吞吐量的瓶颈,尤其是在Intel CPU上.
A slow alternative costs 2 shuffles per hadd
, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_hadd_ps(r2, r2);
_mm_store_ss(&result, r3);
这篇关于如何在C中使用SSE内在函数计算单向量点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!