在AVX寄存器内旋转字节的有效方法 [英] Efficient way of rotating a byte inside an AVX register

查看:115
本文介绍了在AVX寄存器内旋转字节的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要/tl; :除了进行2x移位并将结果混合在一起以外,是否有其他方法可以按位旋转YMM寄存器中的字节(使用AVX)?

Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together?

对于YMM寄存器中的每8个字节,我需要向左旋转7个字节.每个字节需要比前一个向左旋转一位.因此,第一个字节应旋转0位,第七个字节应旋转6位.

For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits.

当前,我已经实现了一种实现方法,该方法是通过[将1位旋转作为示例]将寄存器左移1位,然后右移7位.然后,我使用混合操作(固有操作_mm256_blend_epi16)从第一个和第二个临时结果中选择正确的位,以获取最终的旋转字节.
这样一来,每个字节总共要进行2个移位操作和1个混合操作,并且需要旋转6个字节,因此每个字节要进行18个操作(移位和混合具有几乎相同的性能).

Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the register 1 bit to the left, and 7 to the right individually. I then use the blend operation (intrinsic operation _mm256_blend_epi16) to chose the correct bits from the first and second temporary result to get my final rotated byte.
This costs a total of 2 shift operations and 1 blend operation per byte, and 6 bytes needs to be rotated, thus 18 operations per byte (shift and blend has just about the same performance).

与使用18个操作旋转单个字节相比,必须有一种更快的方法!

There must be a faster way to do this than by using 18 operations to rotate a single byte!

此外,之后我需要在新寄存器中汇编所有字节.我通过使用"set"指令将7个掩码加载到寄存器中来做到这一点,因此我可以从每个寄存器中提取正确的字节.我将这些掩码与寄存器进行与"运算,以从寄存器中提取正确的字节.之后,我将单个字节寄存器进行XOR运算,以获取具有所有字节的新寄存器. 这总共需要执行7 + 7 + 6个操作,因此又需要执行20个操作(每个寄存器).

Furthermore, I need to assemble all the bytes afterwards in the new register. I do this by loading 7 masks with the "set" instruction into registers, so I can extract the correct byte from each register. I AND these mask with the registers to extract the correct byte from them. Afterwards I XOR the single byte registers together to get the new register with all the bytes. This takes a total of 7+7+6 operations, so another 20 operations (per register).

我可以使用提取的内在函数(_mm256_extract_epi8)来获取单个字节,然后使用_mm256_set_epi8来组装新寄存器,但是我还不知道这样做是否会更快. (《英特尔内在函数指南》中没有列出这些功能的性能,因此也许我对这里有所误解.)

I could use the extract intrinsic (_mm256_extract_epi8) to get the single bytes, and then use _mm256_set_epi8 to assemble the new registers, but I don't know yet whether that would be faster. (There is no listed performance for these functions in the Intel intrinsics guide, so maybe I am misunderstanding something here.)

这使每个寄存器总共进行38次操作,对于在寄存器中以不同方式轮换6个字节而言,这似乎不是最佳选择.

This gives a total of 38 operations per register, which seems less than optimal for rotating 6 bytes differently inside a register.

我希望更精通AVX/SIMD的人可以在这里为我提供指导-无论我是否以错误的方式进行操作-因为我觉得我现在可能正在这样做.

I hope someone more proficient in AVX/SIMD can guide me here—whether I am going about this the wrong way—as I feel I might be doing just that right now.

推荐答案

XOP指令集确实提供了 _mm_rot_epi8() (它不是特定于Microsoft的;它在4.4或更早版本的GCC中也可用,并且也应该在最近的clang中可用).它可以用来以128位为单位执行所需的任务.不幸的是,我没有支持XOP的CPU,所以我无法对其进行测试.

The XOP instruction set does provide _mm_rot_epi8() (which is NOT Microsoft-specific; it is also available in GCC since 4.4 or earlier, and should be available in recent clang, too). It can be used to perform the desired task in 128-bit units. Unfortunately, I don't have a CPU with XOP support, so I cannot test that.

在AVX2上,将256位寄存器分成两半,一个包含偶数字节,另一个奇数字节右移8位,可以实现16位向量乘法.给定常量(使用GCC 64位组件数组格式)

On AVX2, splitting the 256-bit register into two halves, one containing even bytes, and the other odd bytes shifted right 8 bits, allows a 16-bit vector multiply to do the trick. Given constants (using GCC 64-bit component array format)

static const __m256i epi16_highbyte = { 0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL };
static const __m256i epi16_lowbyte  = { 0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL };
static const __m256i epi16_oddmuls  = { 0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL };
static const __m256i epi16_evenmuls = { 0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL };

旋转操作可以写为

__m256i byteshift(__m256i value)
{
    return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_lowbyte), epi16_oddmuls), 8),
                           _mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_lowbyte), epi16_evenmuls), epi16_highbyte));
}

已验证这可以使用GCC-4.8.4在Intel Core i5-4200U上产生正确的结果.例如,输入向量(作为单个256位十六进制数)

This has been verified to yield correct results on Intel Core i5-4200U using GCC-4.8.4. As an example, the input vector (as a single 256-bit hexadecimal number)

88 87 86 85 84 83 82 81 38 37 36 35 34 33 32 31 28 27 26 25 24 23 22 21 FF FE FD FC FB FA F9 F8

被旋转到

44 E1 D0 58 24 0E 05 81 1C CD C6 53 A1 CC 64 31 14 C9 C4 52 21 8C 44 21 FF BF BF CF DF EB F3 F8

,其中最左边的八位位组向左旋转7位,然后再向后旋转6位,依此类推;对于所有32个八位位组,第七个八位位组保持不变,第八个八位位组旋转7位,依此类推.

where the leftmost octet is rotated left by 7 bits, next 6 bits, and so on; seventh octet is unchanged, eighth octet is rotated by 7 bits, and so on, for all 32 octets.

我不确定上述函数定义是否可以编译为最佳机器代码(取决于编译器),但是我对它的性能感到满意.

I am not sure if the above function definition compiles to optimal machine code -- that depends on the compiler --, but I'm certainly happy with its performance.

由于您可能不喜欢该函数的上述简洁格式,因此这里采用的是程序扩展形式:

Since you probably dislike the above concise format for the function, here it is in procedural, expanded form:

static __m256i byteshift(__m256i value)
{
    __m256i low, high;
    high = _mm256_srai_epi16(value, 8);
    low = _mm256_and_si256(value, epi16_lowbyte);
    high = _mm256_and_si256(high, epi16_lowbyte);
    low = _mm256_mullo_epi16(low, epi16_lowmuls);
    high = _mm256_mullo_epi16(high, epi16_highmuls);
    low = _mm256_srli_epi16(low, 8);
    high = _mm256_and_si256(high, epi16_highbyte);
    return _mm256_or_si256(low, high);
}


在评论中,彼得·科德斯建议用srli替换srai + and ,以及最终的and + orblendv.前者很有意义,因为它纯粹是一种优化,但是后者可能(实际上,在当前的Intel CPU上)还没有更快.


In a comment, Peter Cordes suggested replacing the srai+and with an srli, and possibly the final and+or with a blendv. The former makes a lot of sense, as it is purely an optimization, but the latter may not (yet, on current Intel CPUs!) actually be faster.

我尝试了一些微基准测试,但无法获得可靠的结果.我通常在x86-64上使用TSC,并使用存储在数组中的输入和输出进行数十万次测试的中位数.

I tried some microbenchmarking, but was unable to get reliable results. I typically use the TSC on x86-64, and take the median of a few hundred thousand tests using inputs and outputs stored to an array.

如果仅在此处列出变体,我认为这是最有用的,因此任何需要这种功能的用户都可以对其实际工作负载进行一些基准测试,并测试是否存在可测量的差异.

I think it is most useful if I will just list the variants here, so any user requiring such a function can make some benchmarks on their real-world workloads, and test to see if there is any measurable difference.

我也同意他的建议,使用oddeven代替highlow,但是请注意,由于向量中的第一个元素编号为元素0,因此第一个元素为偶数,第二个 odd ,依此类推.

I also agree with his suggestion to use odd and even instead of high and low, but note that since the first element in a vector is numbered element 0, the first element is even, the second odd, and so on.

#include <immintrin.h>

static const __m256i epi16_oddmask  = { 0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL,
                                        0xFF00FF00FF00FF00ULL };
static const __m256i epi16_evenmask = { 0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL,
                                        0x00FF00FF00FF00FFULL };
static const __m256i epi16_evenmuls = { 0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL,
                                        0x4040101004040101ULL };
static const __m256i epi16_oddmuls  = { 0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL,
                                        0x8080202008080202ULL };

/* Original version suggested by Nominal Animal. */
__m256i original(__m256i value)
{
    return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
                           _mm256_and_si256(_mm256_mullo_epi16(_mm256_and_si256(_mm256_srai_epi16(value, 8), epi16_evenmask), epi16_oddmuls), epi16_oddmask));
}

/* Optimized as suggested by Peter Cordes, without blendv */
__m256i no_blendv(__m256i value)
{
    return _mm256_or_si256(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
                           _mm256_and_si256(_mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask));
}

/* Optimized as suggested by Peter Cordes, with blendv.
 * This is the recommended version. */
__m256i optimized(__m256i value)
{
    return _mm256_blendv_epi8(_mm256_srli_epi16(_mm256_mullo_epi16(_mm256_and_si256(value, epi16_evenmask), epi16_evenmuls), 8),
                              _mm256_mullo_epi16(_mm256_srli_epi16(value, 8), epi16_oddmuls), epi16_oddmask);
}

此处以显示单个操作的方式编写的相同功能.尽管它根本不影响正常的编译器,但是我已经标记了函数参数和每个临时值const,因此很明显如何将它们插入到后续表达式中,以将函数简化为上述简洁形式.

Here are the same functions written in a way that shows the individual operations. Although it does not affect sane compilers at all, I've marked the function parameter and each temporary value const, so that it is obvious how you can insert each into a subsequent expression, to simplify the functions to their above concise forms.

__m256i original_verbose(const __m256i value)
{
    const __m256i odd1  = _mm256_srai_epi16(value, 8);
    const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
    const __m256i odd2  = _mm256_and_si256(odd1, epi16_evenmask);
    const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
    const __m256i odd3  = _mm256_mullo_epi16(odd3, epi16_oddmuls);
    const __m256i even3 = _mm256_srli_epi16(even3, 8);
    const __m256i odd4  = _mm256_and_si256(odd3, epi16_oddmask);
    return _mm256_or_si256(even3, odd4);
}

__m256i no_blendv_verbose(const __m256i value)
{
    const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
    const __m256i odd1  = _mm256_srli_epi16(value, 8);
    const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
    const __m256i odd2  = _mm256_mullo_epi16(odd1, epi16_oddmuls);
    const __m256i even3 = _mm256_srli_epi16(even2, 8);
    const __m256i odd3  = _mm256_and_si256(odd2, epi16_oddmask);
    return _mm256_or_si256(even3, odd3);
}

__m256i optimized_verbose(const __m256i value)
{
    const __m256i even1 = _mm256_and_si256(value, epi16_evenmask);
    const __m256i odd1  = _mm256_srli_epi16(value, 8);
    const __m256i even2 = _mm256_mullo_epi16(even1, epi16_evenmuls);
    const __m256i odd2  = _mm256_mullo_epi16(odd1, epi16_oddmuls);
    const __m256i even3 = _mm256_srli_epi16(even2, 8);
    return _mm256_blendv_epi8(even3, odd2, epi16_oddmask);
}

我个人最初确实以上述冗长的形式编写测试函数,因为形成简洁的版本是一组琐碎的复制粘贴操作.但是,我确实测试了两个版本以验证是否引入任何错误,并保持详细版本可访问(作为注释),因为简洁版本基本上是只写的.与尝试编辑简洁版本相比,编辑详细版本然后将其简化为简洁形式要容易得多.

I personally do write my test functions initially in their above verbose forms, as forming the concise version is a trivial set of copy-pasting. I do, however, testing both versions to verify against introducing any errors, and keeping the verbose version accessible (as a comment or so), because the concise versions are basically write-only. It is much easier to edit the verbose version, then simplify it to the concise form, than trying to edit the concise version.

这篇关于在AVX寄存器内旋转字节的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆