如何在不将指针声明为volatile的情况下实现CUDA全局内存一致性? [英] How can I enforce CUDA global memory coherence without declaring pointer as volatile?

查看:680
本文介绍了如何在不将指针声明为volatile的情况下实现CUDA全局内存一致性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我会先做一些情境化。我试图在CUDA中使用deques实现非阻塞工作窃取方法。 deques(aDeques)在全局存储器中的块分段数组中,并且popWork()设备函数具有弹出工作以进给线程的目的。除了全局deques,每个块在共享内存(aLocalStack)有一个堆栈,它可以在本地工作。流行发生在3个级别。第一次尝试是在共享堆栈中,第二次尝试是在deque拥有的块和第三次尝试是工作窃取其他deques。每个deque都有全局底部和pop指针,它们位于全局内存数组(aiDequesBottoms和auiDequesAges)中。我的问题是,当一个块改变一个全局deque指针,当我在一个GTS450测试代码的变化不能被其他块可见。似乎缓存没有更新。我也在一个GT520卡测试,其中问题不发生。我经历了类似的问题与aiDequeFlags数组。这些问题通过声明它是易变的。不幸的是,我不能对deque指针数组做同样的事情,因为我需要在以后使用原子函数。我很抱歉不要把问题在一个更简单的例子,但我不能重现这种行为。第一个代码段解释了popWork()接口。

I'll first do some contextualization. I'm trying to implement a non-blocking work stealing method using deques in CUDA. The deques (aDeques) are in a block-segmented array in global memory and the popWork() device function has the objective of popping work to feed threads. In addition of the global deques, each block has a stack in shared memory (aLocalStack) where it can locally work. The pop occurs in 3 levels. First attempt is in the shared stack, second attempt is in the deque owned by the block and third attempt is work steal other deques. Each deque has global bottom and pop pointers that lie in a global memory arrays (aiDequesBottoms and auiDequesAges). My problem is that when a block changes a global deque pointer, the changes aaren't being visible by other blocks when I test code in a GTS450. It seems like cache is not being updated. I have also tested in a GT520 card, where the problem does not occur. I have experienced similar problems with the aiDequeFlags array. These problems are solved by declaring it volatile. Unfortunatelly, I can't do the same to the deque pointer arrays, since I need to use atomic functions on them later. I'm sorry to not put the problem in a simpler example, but I couldn't reproduce this behavior. This first snippet has the popWork() interface explained .

template <int iDequeSize> //Size of each segment in aDeques 
bool __inline__ __device__ popWork(
    volatile int *aiDequeFlags , //Flags that indicates if a deque is active (has work)
    int *aiDequesBottoms , //Deque bottom pointers
    unsigned int *auiDequesAges , //Deque top pointers (29 higher bits) + 
                                  //Tag bits(3 lower bits).
    const Int2Array *aDeques , //Deques (Int2Array is an interface for 2 int arrays)
    int &uiStackBot , //Shared memory stack pointer
    int2 *aLocalStack , //Shared memory local stack
    const int &iTid , //threadIdx.x
    const int &iBid , //blockIdx.x

    //All other parameters are output

unsigned int &uiPopDequeIdx , //Choosen deque for pop
    int2 *popStartIdxAndSize , //Arrays of pop start index and sizes
    bool *bPopFlag , //Array of flags for pop in each level
unsigned int &uiActiveDequesIdx , //Flag to indicate pop failed (no more work)
    int2 &work //Actual acquired thread work)


$ b b

此第二个片段具有整个函数。使用该功能的内核启动时有8个块,64个线程,在开始时只有deque 0有1个工作,而所有其他deques都为空。有一些调试printf调用来生成日志,将在下面的代码段中显示。

This second snippet has the entire function. The kernel that uses the function was launched with 8 blocks, 64 threads and in the beginning just deque 0 has 1 work, while all other deques are empty. There are some debug printf calls to generate a log, which will be show in the next snippet.

template <int iDequeSize>
bool __inline__ __device__ popWork(volatile int *aiDequeFlags , int *aiDequesBottoms , unsigned int *auiDequesAges ,
const Int2Array *aDeques , int &uiStackBot , int2 *aLocalStack , const int &iTid , const int &iBid ,
unsigned int &uiPopDequeIdx , int2 *popStartIdxAndSize , bool *bPopFlag , unsigned int &uiActiveDequesIdx , int2 &work)
{
//Pop from local stack
if(iTid == 0)
{
    unsigned int uiAge = 0;
    bPopFlag[0] = popBottom(uiStackBot , uiAge , popStartIdxAndSize[iBid]); 
    bPopFlag[3] = bPopFlag[0];
}

__syncthreads();

if(bPopFlag[0])
{
    if(iTid < popStartIdxAndSize[iBid].y)
    {
        work = aLocalStack[popStartIdxAndSize[iBid].x + iTid];
    }
}
else
{
    if(iTid == 0)
    {   //Try to pop from block deque

        bPopFlag[1] = popBottom(aiDequesBottoms[iBid] , auiDequesAges[iBid] , popStartIdxAndSize[iBid]);

        if(bPopFlag[1])
        {
            uiPopDequeIdx = iBid;
            //Debug
            if(iBid == 0)
            {
                printf("Block %d pop global deque. Bottom=%d\n" , iBid , aiDequesBottoms[iBid]);
            }
            //
        }
        else
        {
            aiDequeFlags[iBid] = 0;
            popStartIdxAndSize[iBid].x = INFTY;
            uiPopDequeIdx = INFTY;
        }
        bPopFlag[3] = bPopFlag[1];
        bPopFlag[2] = false;
    }
    __syncthreads();

    if(!bPopFlag[1])
    {
        //Verify if lazy steal can be done.
        if(iTid < NDEQUES)
        {
            if(popStartIdxAndSize[iTid].x != INFTY && iTid != iBid)
            {
                atomicMin(&uiPopDequeIdx , iTid);
                bPopFlag[2] = true;
                bPopFlag[3] = true;
            }
        }

        __syncthreads();

        if(iTid == uiPopDequeIdx)
        {
            popStartIdxAndSize[iBid] = popStartIdxAndSize[iTid];
            popStartIdxAndSize[iTid].x = INFTY;
        }

        while(!bPopFlag[3])
        {   //No more work, try to steal some!
            __syncthreads();

            if(iTid == 0)
            {
                uiActiveDequesIdx = 0;
            }
            __syncthreads();

            if(iTid < NDEQUES)
            {
                if(aiDequeFlags[iTid] == 1)
                {
                    uiActiveDequesIdx = 1;

                    //Debug
                    printf("Block %d steal attempt on block %d. Victim bottom=%d\n" , blockIdx.x , threadIdx.x , aiDequesBottoms[iTid]);
                    //

                    if(popTop(aiDequesBottoms , auiDequesAges , iTid , popStartIdxAndSize[iTid]))
                    {
                        aiDequeFlags[iBid] = 1;
                        atomicMin(&uiPopDequeIdx , iTid);
                        bPopFlag[3] = true;

                        //Debug
                        //printf("%d ss %d %d %d\n" , iBid , iTid , popStartIdxAndSize[iTid].x , popStartIdxAndSize[iTid].y);
                        //
                    }
                }
            }

            __syncthreads();

            if(uiActiveDequesIdx == 0)
            { //No more work to steal. End.
                break;
            }

            if(iTid == uiPopDequeIdx)
            {
                popStartIdxAndSize[iBid] = popStartIdxAndSize[iTid];
                popStartIdxAndSize[iTid].x = INFTY;
            }

            __syncthreads();
        }
    }

    __syncthreads();

    if(bPopFlag[3] && iTid < popStartIdxAndSize[iBid].y) //assuming number of threads >= WORK_SIZE
    {
        aDeques->getElement(work , uiPopDequeIdx*iDequeSize + popStartIdxAndSize[iBid].x + iTid);
    }
}

return bPopFlag[3];

}

生成日志。推动线(块X推动底部= Y)由在此未示出的推动功能产生。记住,在开始时,只有块0有1个工作。

This last snippet is the generated log. The push lines ("Block X push. Bottom=Y") were generated by a push function which was not showed here. Remember that in the beginning, just block 0 has 1 work.

Block 0 pop global deque. Bottom=0
Block 4 steal attempt on block 0. Victim bottom=0
Block 2 steal attempt on block 0. Victim bottom=1
Block 6 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 0. Victim bottom=1
Block 3 steal attempt on block 0. Victim bottom=1
Block 7 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 0. Victim bottom=1
Block 6 steal attempt on block 0. Victim bottom=1
Block 3 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 0. Victim bottom=1
Block 4 steal attempt on block 0. Victim bottom=0
Block 7 steal attempt on block 0. Victim bottom=1
Block 0 push. Bottom=448
Block 2 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 0. Victim bottom=1
Block 6 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 0. Victim bottom=1
Block 3 steal attempt on block 0. Victim bottom=1
Block 7 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 0. Victim bottom=1
Block 2 steal attempt on block 4. Victim bottom=0
Block 1 steal attempt on block 0. Victim bottom=1
Block 1 steal attempt on block 4. Victim bottom=0
Block 5 steal attempt on block 0. Victim bottom=1
Block 5 steal attempt on block 4. Victim bottom=0
Block 4 push. Bottom=384

可以看出,只有块4可以看到块0中的变化底部指针。我试图在指针的任何更改后添加一些__threadfence()调用,但没有成功。感谢您的关注!

As can be seen, only block 4 can see the changes in block 0 deque bottom pointer. I tried adding some __threadfence() calls after any change in the pointers but no sucess. Thanks for the attention!

推荐答案

似乎基于评论,唯一的工作解决方案是关闭L1缓存。这可以在程序范围内通过在编译时将以下开关传递到nvcc来实现:

It seems based on the comments that the only working solution is to turn off L1 caching. This can be accomplished on a program-wide basis by passing the following switch to nvcc when compiling:

–Xptxas –dlcm=cg

L1缓存是SM的属性/资源,而不是整个设备。由于线程块在特定SM上执行,所以一个线程块在其L1高速缓存中的活动可能与另一个线程块及其L1高速缓存(假设它正好在不同的SM上运行)的活动不相干,即使它们都引用相同的全局内存中的位置。在不同SM中的L1高速缓存没有彼此连接,并且不能保证彼此一致。

The L1 caches are a property/resource of the SM, not the device as a whole. Since threadblocks execute on specific SMs, the activity of one threadblock in its L1 cache can be incoherent from the activity of another threadblock and its L1 cache (assuming it happens to be running on a different SM), even though they are both referencing the same locations in global memory. L1 caches in different SMs have no connection with each other and are not guaranteed to be coherent with each other.

请注意,L2高速缓存是设备范围的,因此 从单个螺纹块的角度。关闭L1高速缓存对L2高速缓存没有影响,因此仍然有一些高速缓存好处的可能性,然而满足L2请求所需的时间比满足L1请求所需的时间长,因此关闭

Note that the L2 cache is device-wide and therefore "coherent" from the perspective of individual threadblocks. Turning off L1 caching has no effect on L2 caching, so there is still the possibility of some caching benefit, however the time required to satisfy a request out of L2 is longer than the time required to satisfy a request out of L1, so turning off L1 caching program-wide is a pretty large hammer to try to get things working.

前面的 volatile 关键字是一个非常大的工具。的变量定义应该具有告诉编译器跳过L1缓存在负载(根据我的理解)的效果。但是volatile本身并不解决写入路径,因此一个SM中的一个线程块可以执行 volatile 读取,从L2中拉出一个值,修改该值,然后写回,在那里它结束在L1(直到它被驱逐)。如果另一个线程块读取相同的全局值,它可能看不到更新的效果。

The volatile keyword in front of a variable definition should have the effect of telling the compiler to skip L1 caching on loads (according to my understanding). But volatile by itself doesn't address the write path, so it's possible for one threadblock in one SM to do a volatile read, pulling a value out of L2, modify that value, and then write it back, where it ends up in L1 (until it is evicted). If another threadblock reads the same global value, it may not see the effect of the update.

轻松使用

Diligent use of __threadfence() while tedious, should force any such updates out of L1 into L2, so that other threadblocks can read them. However this still leaves a synchronization gap from when the value was written to when it is observable by other SMs/threadblocks.

(全局) Atomics 也应该具有直接进入全局内存的效果读取和写入所使用的值。

(Global) Atomics should also have the effect of going directly to "global memory" to read and write the values used.

也可以通过代码来确保从全局同步位置的每一个可能的读取被正确处理(例如, code> volatile 或使用atomics),并且每个可能的写入一个全局同步的位置被正确处理(例如 __ threadfence()原子),并且还检查不同块之间的竞争条件。

It may be instructive to also go through the code to ensure that every possible read from a globally synchronized location is handled properly (e.g. with volatile or using atomics) and that every possible write to a globally synchronized location is handled properly (e.g. with __threadfence() or atomics), and also check for race conditions between different blocks.

正如所发现的,在GPU内创建稳定的全局同步环境的过程是不平凡的。这些其他问题也可能是您感兴趣的(例如关于开普勒)(以及例如讨论全局信号量) 。

As discovered, the process of creating a stable globally-synchronized environment within the GPU is non-trivial. These other questions may also be of interest (e.g. with respect to Kepler) (and e.g. discussing global semaphores).

编辑:要回复评论中发布的问题,我会这样说:

To respond to a question posted in the comments, I would say this:

没有任何问题。但是 __ threadfence()不能保证(我知道)最大完成时间。因此,在对全局位置进行更新时,仅更新与执行的线程块/ SM相关联的L1。然后我们点击 __ threadfence()。假定线程框架需要一些时间来完成,并且在该时间期间,另一线程块可以驻留在同一SM上,被引入用于执行(而先前的线程/ warp / block在线程框架处停止),并且看到更新的全局值在与该SM相关联的(本地)L1中。在其他SM中执行的其他线程块将会看到stale值,直到 __ threadfence()完成。这就是我所说的一个可能的同步间隙。两个不同的块在短时间内仍然可以看到两个不同的值。这是否重要将取决于如何使用全局值用于块之间的同步(因为这是正在讨论的主题)。因此atomics + volatile可能是比volatile + threadfence更好的选择,以尝试覆盖read并写入同步的路径。

Perhaps there's no issue. However __threadfence() provides no guarantee (that I know of) for a maximum completion time. Therefore at the moment an update is made to a global location, only the L1 associated with the executing threadblock/SM gets updated. Then we hit the __threadfence(). Presumably threadfence takes some time to complete, and during this time another threadblock could be resident on the same SM, brought in for execution (while the previous thread/warp/block is stalled at the threadfence), and "see" the updated global value in the (local) L1 associated with that SM. Other threadblocks executing in other SMs will see the "stale" value until the __threadfence() completes. This is what I am referring to as a possible "synchronization gap". Two different blocks can still see two different values, for a brief period of time. Whether this matters or not will be dependent on how the global value is being used for synchronization between blocks (since that is the topic under discussion.) Therefore atomics + volatile may be a better choice than volatile + threadfence, to try and cover both read and write paths for synchronization.

编辑#2:从评论看来,使用atomics加 volatile 也解决了这个问题。

Edit #2: It seems from the comments that the comination of the use of atomics plus volatile also solved the problem.

这篇关于如何在不将指针声明为volatile的情况下实现CUDA全局内存一致性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆