等效于AVX2的_mm_alignr_epi8(PALIGNR) [英] _mm_alignr_epi8 (PALIGNR) equivalent in AVX2

查看：99 发布时间：2020/9/15 5:39:54 x86 simd intrinsics avx avx2

本文介绍了等效于AVX2的_mm_alignr_epi8(PALIGNR)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在SSE3中，PALIGNR指令执行以下操作:

In SSE3, the PALIGNR instruction performs the following:

PALIGNR将目标操作数(第一个操作数)和源操作数(第二个操作数)级联为中间合成，将合成以字节粒度向右移动一个常数立即数，并将右对齐的结果提取到目的地.

PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination.

我目前正在移植我的SSE4代码以使用AVX2指令，并且正在处理256位而不是128位的寄存器. 天真的，我相信内在函数_mm256_alignr_epi8(VPALIGNR)仅在256位寄存器上执行与_mm_alignr_epi8相同的操作.可悲的是，事实并非如此.实际上，_mm256_alignr_epi8将256位寄存器视为2个128位寄存器，并对两个相邻的128位寄存器执行2次对齐"操作.有效执行与_mm_alignr_epi8相同的操作，但一次在2个寄存器上执行.此处最清楚地说明了它: _mm256_alignr_epi8

I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit. Naively, I believed that the intrinsics function _mm256_alignr_epi8 (VPALIGNR) performs the same operation as _mm_alignr_epi8 only on 256bit registers. Sadly however, that is not exactly the case. In fact, _mm256_alignr_epi8 treats the 256bit register as 2 128bit registers and performs 2 "align" operations on the two neighboring 128bit registers. Effectively performing the same operation as _mm_alignr_epi8 but on 2 registers at once. It's most clearly illustrated here: _mm256_alignr_epi8

目前，我的解决方案是继续使用_mm_alignr_epi8，方法是将ymm(256位)寄存器拆分为两个xmm(128位)寄存器(高和低)，如下所示:

Currently my solution is to keep using _mm_alignr_epi8 by splitting the ymm (256bit) registers into two xmm (128bit) registers (high and low), like so:

__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);

这可行，但是必须有更好的方法，对吗? 有没有可能应该使用更通用"的AVX2指令来获得相同的结果?

This works, but there has to be a better way, right? Is there a perhaps more "general" AVX2 instruction that should be using to get the same result?

推荐答案

palignr的用途是什么?如果仅是为了处理数据不对齐，则只需使用不对齐的负载即可；在现代Intel µ架构上，它们通常足够快"(并且将为您节省很多代码量).

What are you using palignr for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).

如果由于某些其他原因需要类似palignr的行为，则可以简单地利用未对齐的负载支持以无分支的方式进行操作.除非您完全受负载存储约束，否则这可能是首选的习惯用法.

If you need palignr-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
    // Do whatever your compiler needs to make this buffer 64-byte aligned.
    // You want to avoid the possibility of a page-boundary crossing load.
    char buffer[64];

    // Two aligned stores to fill the buffer.
    _mm256_store_si256((__m256i *)&buffer[0], v0);
    _mm256_store_si256((__m256i *)&buffer[32], v1);

    // Misaligned load to get the data we want.
    return _mm256_loadu_si256((__m256i *)&buffer[n]);
}

如果您可以确切地使用palignr来提供有关如何的更多信息，那么我可能会有所帮助.

If you can provide more information about how exactly you're using palignr, I can probably be more helpful.

这篇关于等效于AVX2的_mm_alignr_epi8(PALIGNR)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

等效于AVX2的_mm_alignr_epi8(PALIGNR) [英] _mm_alignr_epi8 (PALIGNR) equivalent in AVX2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

等效于AVX2的_mm_alignr_epi8(PALIGNR) [英] _mm_alignr_epi8 (PALIGNR) equivalent in AVX2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭