如何在C中使用SSE内在函数计算单向量点积 [英] How to Calculate single-vector Dot Product using SSE intrinsic functions in C

查看：211 发布时间：2020/5/21 20:28:15 c optimization vectorization sse simd

本文介绍了如何在C中使用SSE内在函数计算单向量点积的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将两个向量相乘，其中一个向量的每个元素都与另一个向量相同索引中的元素相乘.然后，我想对所得向量的所有元素求和以获得一个数字.例如，向量{1,2,3,4}和{5,6,7,8}的计算看起来像这样:

I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the elements of the resulting vector to obtain one number. For instance, the calculation would look like this for the vectors {1,2,3,4} and {5,6,7,8}:

1*5 + 2*6 + 3*7 + 4*8

本质上，我正在计算两个向量的点积.我知道有一个SSE命令可以执行此操作，但是该命令没有与之关联的固有功能.在这一点上，我不想在我的C代码中编写内联汇编，所以我只想使用内部函数.这似乎是一种普遍的计算，所以我为自己在Google上找不到答案感到惊讶.

Essentially, I am taking the dot product of the two vectors. I know there is an SSE command to do this, but the command doesn't have an intrinsic function associated with it. At this point, I don't want to write inline assembly in my C code, so I want to use only intrinsic functions. This seems like a common calculation so I am surprised by myself that I couldn't find the answer on Google.

注意:我正在针对特定的微体系结构进行优化，该体系结构最多支持SSE 4.2.

Note: I am optimizing for a specific micro architecture which supports up to SSE 4.2.

推荐答案

如果您要对更长的向量进行点积运算，请在内循环内使用乘法和常规_mm_add_ps(或FMA). /strong>保存水平总和，直到结束.

If you're doing a dot-product of longer vectors, use multiply and regular _mm_add_ps (or FMA) inside the inner loop. Save the horizontal sum until the end.

但是，如果您只做一对SIMD向量的点积:

But if you are doing a dot product of just a single pair of SIMD vectors:

GCC(至少4.3版)包括具有SSE4.1级内在函数的<smmintrin.h>，包括单精度和双精度点乘积:

GCC (at least version 4.3) includes <smmintrin.h> with SSE4.1 level intrinsics, including the single and double-precision dot products:

_mm_dp_ps (__m128 __X, __m128 __Y, const int __M); _mm_dp_pd (__m128d __X, __m128d __Y, const int __M);

在Intel主流CPU(不是Atom/Silvermont)上，这些速度比手动执行多条指令要快.

On Intel mainstream CPUs (not Atom/Silvermont) these are somewhat faster than doing it manually with multiple instructions.

但是在AMD(包括Ryzen)上，dpps明显慢一些. (请参见阿格纳·雾的说明表)

But on AMD (including Ryzen), dpps is significantly slower. (See Agner Fog's instruction tables)

作为旧处理器的后备，您可以使用此算法来创建向量a和b的点积:

As a fallback for older processors, you can use this algorithm to create the dot product of the vectors a and b:

__m128 r1 = _mm_mul_ps(a, b);

然后使用

and then horizontal sum r1 using Fastest way to do horizontal float vector sum on x86 (see there for a commented version of this, and why it's faster.)

__m128 shuf = _mm_shuffle_ps(r1, r1, _MM_SHUFFLE(2, 3, 0, 1)); __m128 sums = _mm_add_ps(r1, shuf); shuf = _mm_movehl_ps(shuf, sums); sums = _mm_add_ss(sums, shuf); float result = _mm_cvtss_f32(sums);

慢速替代品每个hadd会花费2个洗牌次数，这很容易成为洗牌吞吐量的瓶颈，尤其是在Intel CPU上.

A slow alternative costs 2 shuffles per hadd, which will easily bottleneck on shuffle throughput, especially on Intel CPUs.

r2 = _mm_hadd_ps(r1, r1); r3 = _mm_hadd_ps(r2, r2); _mm_store_ss(&result, r3);

这篇关于如何在C中使用SSE内在函数计算单向量点积的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在C中使用SSE内在函数计算单向量点积 [英] How to Calculate single-vector Dot Product using SSE intrinsic functions in C

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在C中使用SSE内在函数计算单向量点积 [英] How to Calculate single-vector Dot Product using SSE intrinsic functions in C

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭