使用 AVX 指令进行水平向量求和的最快方法 [英] Fastest way to do horizontal vector sum with AVX instructions

查看：34 发布时间：2022/1/6 12:54:15 x86 sse simd avx vector-processing

本文介绍了使用 AVX 指令进行水平向量求和的最快方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含四个 64 位浮点值的压缩向量.
我想得到向量元素的总和.

I have a packed vector of four 64-bit floating-point values.
I would like to get the sum of the vector's elements.

使用 SSE(并使用 32 位浮点数)，我可以执行以下操作:

With SSE (and using 32-bit floats) I could just do the following:

v_sum = _mm_hadd_ps(v_sum, v_sum);
v_sum = _mm_hadd_ps(v_sum, v_sum);

不幸的是，尽管 AVX 具有 _mm256_hadd_pd 指令，但它的结果与 SSE 版本不同.我相信这是因为大多数 AVX 指令分别作为每个低 128 位和高 128 位的 SSE 指令工作，而不会跨越 128 位边界.

Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.

理想情况下，我正在寻找的解决方案应遵循以下准则:
1) 仅使用 AVX/AVX2 指令.(无 SSE)
2) 不超过 2-3 条指令.

Ideally, the solution I am looking for should follow these guidelines:
1) only use AVX/AVX2 instructions. (no SSE)
2) do it in no more than 2-3 instructions.

然而，任何高效/优雅的方式(即使不遵循上述准则)总是被广泛接受.

However, any efficient/elegant way to do it (even without following the above guidelines) is always well accepted.

非常感谢您的帮助.

-路易吉·卡斯特利

推荐答案

如果你有两个 __m256d 向量 x1 和 x2，每个向量都包含四个 double 你想要水平求和，你可以这样做:

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do:

__m256d x1, x2;
// calculate 4 two-element horizontal sums:
// lower 64 bits contain x1[0] + x1[1]
// next 64 bits contain x2[0] + x2[1]
// next 64 bits contain x1[2] + x1[3]
// next 64 bits contain x2[2] + x2[3]
__m256d sum = _mm256_hadd_pd(x1, x2);
// extract upper 128 bits of result
__m128d sum_high = _mm256_extractf128_pd(sum1, 1);
// add upper 128 bits of sum to its lower 128 bits
__m128d result = _mm_add_pd(sum_high, _mm256_castpd256_pd128(sum));
// lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
// upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]

所以看起来 3 条指令将完成您需要的 2 条水平求和.以上未经测试，但您应该了解概念.

So it looks like 3 instructions will do 2 of the horizontal sums that you need. The above is untested, but you should get the concept.

这篇关于使用 AVX 指令进行水平向量求和的最快方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 AVX 指令进行水平向量求和的最快方法 [英] Fastest way to do horizontal vector sum with AVX instructions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 AVX 指令进行水平向量求和的最快方法 [英] Fastest way to do horizontal vector sum with AVX instructions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭