CUDA块与GTS 250和Fermi器件之间的同步差异 [英] CUDA block synchronization differences between GTS 250 and Fermi devices

查看：211 发布时间：2017/3/5 18:47:55 cuda synchronization gpgpu nvidia

本文介绍了CUDA块与GTS 250和Fermi器件之间的同步差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

所以我一直在工作的程序，我在全局内存中创建一个哈希表。代码在GTS250是一个Compute 1.1设备是完全功能（虽然较慢）。但是，在Compute 2.0设备（C2050或C2070）上，哈希表已损坏（数据不正确，指针有时错误）。

So I've been working on program in which I'm creating a hash table in global memory. The code is completely functional (albeit slower) on a GTS250 which is a Compute 1.1 device. However, on a Compute 2.0 device (C2050 or C2070) the hash table is corrupt (data is incorrect and pointers are sometimes wrong).

基本上，使用一个块（两个设备）。但是，当使用2个或更多块时，它只能在GTS250上工作，而不能在任何Fermi器件上工作。

Basically the code works fine when only one block is utilized (both devices). However, when 2 or more blocks are used, it works only on the GTS250 and not on any Fermi devices.

我知道两者之间的翘曲调度和内存架构平台是不同的，我在开发代码时考虑到这一点。从我的理解，使用 __ theadfence（）应该确保任何全局写入提交和其他块可见，但是，从损坏的哈希表，似乎他们不是。

I understand that the warp scheduling and memory architecture between the two platforms are different and I am taking that into account when developing the code. From my understanding, using __theadfence() should make sure any global writes are committed and visible to other blocks, however, from the corrupt hash table, it appears that they are not.

我也在NVIDIA CUDA开发者论坛上发布了这个问题，可以找到此处。

I've also posted the problem on the NVIDIA CUDA developer forum and it can be found here.

以下相关代码：

__device__ void lock(int *mutex) {
    while(atomicCAS(mutex, 0, 1) != 0);
}

__device__ void unlock(int *mutex) {
    atomicExch(mutex, 0);
}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
    // Find entry if it exists
    unsigned int hashValue = hash(key, table.count);

    lock(&globalHashLocks[hashValue]);

    int bucketHead = table.entries[hashValue];
    int currentLocation = bucketHead;

    bool found = false;
    Entry currentEntry;

    while (currentLocation != -1 && !found) {
        currentEntry = table.pool[currentLocation];
        if (currentEntry.data.x == key) {
            found = true;
        } else {
            currentLocation = currentEntry.next;
        }
    }

    if (currentLocation == -1) {
        // If entry does not exist, create entry
        lock(globalFreeLock);
        int newLocation = (*globalFirstFree)++;
        __threadfence();
        unlock(globalFreeLock);

        Entry newEntry;
        newEntry.data.x = key;
        newEntry.data.y = count;
        newEntry.data.z = sum;
        newEntry.data.w = sumSquared;
        newEntry.next = bucketHead;

        // Add entry to table
        table.pool[newLocation] = newEntry;
        table.entries[hashValue] = newLocation;
    } else {
        currentEntry.data.y += count;
        currentEntry.data.z += sum;
        currentEntry.data.w += sumSquared;
        table.pool[currentLocation] = currentEntry;
    }

    __threadfence();
    unlock(&globalHashLocks[hashValue]);
}

推荐答案

href =http://forums.nvidia.com/index.php?showuser=71314 =nofollow> LSChien 在此 post ，问题是L1缓存一致性。使用 __ threadfence（）将保证共享和全局内存写入对其他线程可见，因为它不是原子的，线程x 可能达到缓存的内存值，直到线程y 在 / code>已经执行到threadfence指令。相反LSChien建议在他的帖子中使用 atomicCAS（）强制线程从全局内存读取而不是缓存的值。正确的方法是通过将内存声明为 volatile ，要求对该内存的每次写入都立即对网格中的所有其他线程可见。


As pointed out by LSChien in this post, the issue is with L1 cache coherency.  While using __threadfence() will guarantee shared and global memory writes are visible to other threads, since it is not atomic, thread x in block 1 may reach a cached memory value until thread y in block 0 has executed to the threadfence instruction. Instead LSChien suggested a hack in his post of using an atomicCAS() to force the thread to read from global memory instead of a cached value. The proper way to do this is by declaring the memory as volatile, requiring that every write to that memory be visible to all other threads in the grid immediately.

                        这篇关于CUDA块与GTS 250和Fermi器件之间的同步差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

CUDA块与GTS 250和Fermi器件之间的同步差异 [英] CUDA block synchronization differences between GTS 250 and Fermi devices

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

CUDA块与GTS 250和Fermi器件之间的同步差异 [英] CUDA block synchronization differences between GTS 250 and Fermi devices

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭