在CUDA中使用SIMD实现位旋转运算符 [英] Implementation of bit rotate operators using SIMD in CUDA

查看:91
本文介绍了在CUDA中使用SIMD实现位旋转运算符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道StackOverflow并不是要向其他人询问代码,而是让我讲话.

I know that StackOverflow is not meant for asking code to other persons, but let me speak.

我正在尝试在CUDA C ++设备代码中实现一些AES功能.在尝试实现左字节旋转运算符时,我感到不安,因为它没有本机的SIMD本征.所以我开始一个幼稚的实现,但是....它很大,虽然我还没有尝试过,但是由于拆包/打包的昂贵,它不会很快...所以,有没有办法做每字节位轮换操作至少有点有效吗?

I am trying to implement some AES functions in CUDA C++ device code. While trying to implement the left bytewise rotate operator, I was disconcerted to see that there was no native SIMD intrisic for that. So I began a naive implementation, but....it's huge, and while I haven't tried it yet, it just won't be fast because of the expensive unpacking/packing... So, is there a mean to do a per byte bit rotate operation that's at least somewhat efficient ?

如果您不想看的话,这里是代码.

Here's the code if you wan't to have a look.

__inline__ __device__ uint32_t per_byte_bit_left_rotate(uint32_t input, uint8_t amount) {
return ((((input & 0xFF) >> 0) << amount) | (((input & 0xFF) >> 0) >> 7) & ~0x100) << 0 |
     ((((input & 0xFF00) >> 8) << amount) | ((input & 0xFF00 >> 8) >> 7) & ~0x100) << 8 |
     ((((input & 0xFF0000) >> 16) << amount) | ((input & 0xFF0000 >> 16) >> 7) & ~0x100) << 16 |
     ((((input & 0xFF000000) >> 24) << amount) | ((input & 0xFF000000 >> 24) >> 7) & ~0x100) << 24; } // The XORs are for clearing the old 7th bit who is getting pushed to the next byte of the intermediate int

推荐答案

所有元素的轮换计数都一样,对吧?

The rotate count is the same for all elements, right?

左右移动整个输入,然后用掩码将那些跨越字节边界的所有位归零的掩码,并将所有4个字节合并为一个AND.我认为 amount 在AES中始终是编译时常量,因此您不必担心动态生成掩码的运行时成本.只需让编译器执行即可.(IDK CUDA,但这似乎与编写 SWAR位黑客普通C ++的32位整数)

Shift the whole input left and right, and then AND those with masks that zero all the bits that crossed a byte boundary, for all 4 bytes in one AND. I think amount is always a compile-time constant in AES, so you don't have to worry about the runtime cost of generating the masks on the fly. Just let the compiler do it. (IDK CUDA, but this appears to be the same problem as writing a SWAR bit-hack with 32-bit integers for normal C++)

这是基于通常的 (x<< count)|(x>>(32计数))旋转习惯用语,带有掩码和不同的右移计数,以使其成为单独的8位旋转.

This is based on the usual (x << count) | (x >> (32-count)) rotate idiom, with masking and a different right-shift count to make it into separate 8-bit rotates.

inline
uint32_t per_byte_bit_left_rotate(uint32_t input, unsigned amount)
{
    // With constant amount, the left/right masks are constants
    uint32_t rmask = 0xFF >> ((8 - amount) & 7);
    rmask = (rmask<<24 | rmask<<16 | rmask<<8 | rmask);
    uint32_t lmask = ~rmask;

    uint32_t lshift = input << amount;
    lshift &= lmask;
    if (amount == 1) {  // special case left-shift by 1 using an in-lane add instead of shift&mask
        lshift = __vadd4(input, input);
    }
    uint32_t rshift = input >> ((8 - amount) & 7);
    rshift &= rmask;

    uint32_t rotated = lshift | rshift;
    return rotated;
}

在移位之前以一种方式屏蔽输入,并在移位之后以屏蔽方式屏蔽输出((in& lmask)< amount |((in(>(8-amount)))& rmask),并使用不同的lmask).NVidia硬件是有序的超标量,并且移位具有有限的吞吐量.以这种方式执行此操作更有可能作为两个独立的shift + mask对执行.

It might be even more efficient to mask the input one way before shifting, and mask the output after shifting ((in&lmask)<<amount | ((in>>(8-amount))&rmask), with a different lmask). NVidia hardware is in-order superscalar, and shifts have limited throughput. Doing it that way would be more likely to execute as two an independent shift+mask pairs.

(这不会尝试避免使用数量> = 32的C ++ UB.请参见

(This doesn't try to avoid C++ UB with amount>=32. See Best practices for circular shift (rotate) operations in C++. In this case, I think changing to lshift = input << (amount & 7) would do the trick.

为了测试该文件能否有效编译,我查看了

To test that this compiles efficiently, I looked at the clang -O3 asm output for x86-64 with a constant amount. The Godbolt compiler explorer has compilers for various architectures (not CUDA though), so click that link and flip to ARM, MIPS or PowerPC if you can read those asm languages more easily than x86.

uint32_t rol7(uint32_t a) {
    return per_byte_bit_left_rotate(a, 7);
}
    mov     eax, edi
    shl     eax, 7
    shr     edi
    and     eax, -2139062144   # 0x80808080
    and     edi, 2139062143    # 0x7F7F7F7F
    lea     eax, [rdi + rax]   # ADD = OR when no bits intersect
    ret

完美,正是我所希望的.

Perfect, exactly what I hoped for.

几个测试用例:

uint32_t test_rol() {
    return per_byte_bit_left_rotate(0x02ffff04, 0);
}
    // yup, returns the input with count=0
    // return 0x2FFFF04


uint32_t test2_rol() {
    return per_byte_bit_left_rotate(0x02f73804, 4);
}
    // yup, swaps nibbles
    // return 0x207F8340

这与使用x86 SSE2/AVX2进行8位移位所需的操作相同,因为硬件支持的最小移位粒度为16位.

This is the same kind of thing you need to do for 8-bit shifts with x86 SSE2 / AVX2, because the smallest shift granularity the hardware supports is 16-bit.

这篇关于在CUDA中使用SIMD实现位旋转运算符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆