CUDA:对未签名字符的原子操作 [英] CUDA: Atomic operations on unsigned chars

查看:66
本文介绍了CUDA:对未签名字符的原子操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是CUDA初学者.我在全局内存中有一个无符号字符的像素缓冲区,该缓冲区可以被任何线程和所有线程更新.因此,为了避免像素值怪异,我想在线程尝试更新一个时执行atomicExch.但是编程指南说此功能仅适用于32位或64位字,而我只想原子交换一个8位字节.有办法吗?

I'm a CUDA beginner. I have a pixel buffer of unsigned chars in global memory that can and is updated by any and all threads. To avoid weirdness in the pixel values, therefore, I want to perform an atomicExch when a thread attempts to update one. But the programming guide says that this function only works on 32- or 64-bit words, whereas I just want to atomically exchange one 8-bit byte. Is there a way to do this?

谢谢.

推荐答案

我最近遇到了这个问题.从理论上讲,原子操作/乐观重试应该比锁/互斥体快,因此对其他数据类型使用原子操作的黑客"解决方案在我看来比使用关键部分要好.

I just ran into this problem recently. In theory, atomic operations / optimistic retries are supposed to be faster than locks/mutexes, so the "hack" solutions that use atomic operations on other data types seem better to me than using critical sections.

这是基于我已经测试了所有这些功能,并且测试似乎表明它们到目前为止可以正常工作.

I've tested all of these, and my tests seem to show that they work fine so far.

atomicAdd的版本1(用于char)

__device__ static inline char atomicAdd(char* address, char val) {
    // offset, in bytes, of the char* address within the 32-bit address of the space that overlaps it
    size_t long_address_modulo = (size_t) address & 3;
    // the 32-bit address that overlaps the same memory
    auto* base_address = (unsigned int*) ((char*) address - long_address_modulo);
    // A 0x3210 selector in __byte_perm will simply select all four bytes in the first argument in the same order.
    // The "4" signifies the position where the first byte of the second argument will end up in the output.
    unsigned int selectors[] = {0x3214, 0x3240, 0x3410, 0x4210};
    // for selecting bytes within a 32-bit chunk that correspond to the char* address (relative to base_address)
    unsigned int selector = selectors[long_address_modulo];
    unsigned int long_old, long_assumed, long_val, replacement;

    long_old = *base_address;

    do {
        long_assumed = long_old;
        // replace bits in long_old that pertain to the char address with those from val
        long_val = __byte_perm(long_old, 0, long_address_modulo) + val;
        replacement = __byte_perm(long_old, long_val, selector);
        long_old = atomicCAS(base_address, long_assumed, replacement);
    } while (long_old != long_assumed);
    return __byte_perm(long_old, 0, long_address_modulo);
}

atomicCAS for char

__device__ static inline char atomicCAS(char* address, char expected, char desired) {
    size_t long_address_modulo = (size_t) address & 3;
    auto* base_address = (unsigned int*) ((char*) address - long_address_modulo);
    unsigned int selectors[] = {0x3214, 0x3240, 0x3410, 0x4210};

    unsigned int sel = selectors[long_address_modulo];
    unsigned int long_old, long_assumed, long_val, replacement;
    char old;

    long_val = (unsigned int) desired;
    long_old = *base_address;
    do {
        long_assumed = long_old;
        replacement = __byte_perm(long_old, long_val, sel);
        long_old = atomicCAS(base_address, long_assumed, replacement);
        old = (char) ((long_old >> (long_address_modulo * 8)) & 0x000000ff);
    } while (expected == old && long_assumed != long_old);

    return old;
}

atomicAdd的版本2(用于char)(使用位移而不是__byte_perm,因此必须处理溢出)

__device__ static inline char atomicAdd2(char* address, char val) {
    size_t long_address_modulo = (size_t) address & 3;
    auto* base_address = (unsigned int*) ((char*) address - long_address_modulo);
    unsigned int long_val = (unsigned int) val << (8 * long_address_modulo);
    unsigned int long_old = atomicAdd(base_address, long_val);

    if (long_address_modulo == 3) {
        // the first 8 bits of long_val represent the char value,
        // hence the first 8 bits of long_old represent its previous value.
        return (char) (long_old >> 24);
    } else {
        // bits that represent the char value within long_val
        unsigned int mask = 0x000000ff << (8 * long_address_modulo);
        unsigned int masked_old = long_old & mask;
        // isolate the bits that represent the char value within long_old, add the long_val to that,
        // then re-isolate by excluding bits that represent the char value
        unsigned int overflow = (masked_old + long_val) & ~mask;
        if (overflow) {
            atomicSub(base_address, overflow);
        }
        return (char) (masked_old >> 8 * long_address_modulo);
    }
}

对于atomicMin,请检查

For atomicMin, please check this thread.

这篇关于CUDA:对未签名字符的原子操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆