在没有 AVX(2) 的情况下进行 SIMD 收集的最快方法是什么? [英] What is the fastest way to do a SIMD gather without AVX(2)?

查看：46 发布时间：2021/8/27 19:46:14 x86 sse simd sse4

本文介绍了在没有 AVX(2) 的情况下进行 SIMD 收集的最快方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有 SSE 到 SSE4.1，但没有 AVX(2)，加载这样的打包内存布局(所有 32 位整数)的最快方法是什么:

Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers):

a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3

分成四个向量a, b, c, d?

a: {a0, a1, a2, a3}
b: {b0, b1, b2, b3}
c: {c0, c1, c2, c3}
d: {d0, d1, d2, d3}

我不确定这是否相关，但在我的实际应用程序中，我有 16 个向量，因此 a0 和 a1 相距 16*4 个字节记忆中.

I'm not sure whether this is relevant or not, but in my actual application I have 16 vectors and as such a0 and a1 are 16*4 bytes apart in memory.

推荐答案

这里你需要的是 4 个加载，然后是 4x4 转置:

What you need here is 4 loads followed by a 4x4 transpose:

#include "emmintrin.h"                       // SSE2

v0 = _mm_load_si128((__m128i *)&a[0]);       // v0 = a0 b0 c0 d0 
v1 = _mm_load_si128((__m128i *)&a[16]);      // v1 = a1 b1 c1 d1
v2 = _mm_load_si128((__m128i *)&a[32]);      // v2 = a2 b2 c2 d2
v3 = _mm_load_si128((__m128i *)&a[48]);      // v3 = a3 b3 c3 d3

// 4x4 transpose

w0 = _mm_unpacklo_epi32(v0, v1);             // w0 = a0 a1 b0 b1 
w1 = _mm_unpackhi_epi32(v0, v1);             // w1 = c0 c1 d0 d1 
w2 = _mm_unpacklo_epi32(v2, v3);             // w2 = a2 a3 b2 b3
w3 = _mm_unpackhi_epi32(v2, v3);             // w3 = c2 c3 d2 d3
v0 = _mm_unpacklo_epi64(w0, w2);             // v0 = a0 a1 a2 a3
v1 = _mm_unpackhi_epi64(w0, w2);             // v1 = b0 b1 b2 b3
v2 = _mm_unpacklo_epi64(w1, w3);             // v2 = c0 c1 c2 c3
v3 = _mm_unpackhi_epi64(w1, w3);             // v3 = d0 d1 d2 d3

注意:这可能比使用 AVX2 收集加载更有效，因为它们为每个元素生成一个读取周期，这使得它们仅在访问模式未知或难以使用时才真正有用.

Note: this is probably more efficient than using AVX2 gathered loads, since they generate a read cycle per element, which makes them really only useful when the access pattern is unknown or difficult to work with.

这篇关于在没有 AVX(2) 的情况下进行 SIMD 收集的最快方法是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在没有 AVX(2) 的情况下进行 SIMD 收集的最快方法是什么? [英] What is the fastest way to do a SIMD gather without AVX(2)?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在没有 AVX(2) 的情况下进行 SIMD 收集的最快方法是什么? [英] What is the fastest way to do a SIMD gather without AVX(2)?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭