仅使用随机播放+选票+ popc从经纱中的选定车道写入数据 [英] Write data only from selected lanes in a Warp using Shuffle + ballot + popc

查看:77
本文介绍了仅使用随机播放+选票+ popc从经纱中的选定车道写入数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试保存一些共享内存以改善处理时遇到一些问题.在此代码中,每个线程都从主机中选择一个模板索引(t_initdwl),并将其扩展到下一个内核对其进行处理之前的上限.

这个想法是使用shuffle +选票来保留"适当的空间量,以便仅从不超过上限的线程中写入索引.由于它是一组限制,因此首先要测试每个列限制,然后将其写入. 扩展条目"的数量根据初始索引中的值而有所不同,但是,一旦线程超过限制,任何进一步的增量都是无效的,因此将play = false(为安全起见)设置并返回.

问题是bCol值始终为trasnp_line + 1,从而使popc仅在此行中无法正常工作,因为1仅是lineID 0的正确值.我没有任何错误,

考虑到仍在播放"(尚未返回)的行数,预期结果是要写入的行的正确位置.

__global__ void dwLgen(const int maxthreads, short* __restrict__ kpL, int* nkpl, 
        const short* __restrict__ init_dwL, const short rloops){
    __shared__ short t_initdwl[1024][24];

    const int warpID(threadIdx.x / warpSize);
    const int laneID(threadIdx.x % warpSize);
    const int st(rloops + 2);
    const int stb((p - kpMax + 1 + 1) * BUFFERSIZE_MAX); 
    const int idx = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int cAlive, tAlive, bCol, bline, transp_line;  
    int i, j; 
    bool volatile play(true);

    if (idx < maxthreads) {
        for (j = 0; j < st; j++){
            t_initdwl[threadIdx.x][j] = init_dwL[idx + j * BUFFERSIZE_MAX];
        }
        __syncthreads();
        for (i = 0; i < p; i++) {
            for (j = 0; j < st; j++)
                if ((t_initdwl[threadIdx.x][j] + i) > dwLt[j]) {
                    play = false;
                    return;
                }

            cAlive = __ballot((play == true));
            tAlive = __popc(cAlive);
            bline = __ffs(cAlive) - 1;
            if (laneID == bline)
                transp_line = atomicAdd(nkpl, tAlive);
            transp_line = __shfl(transp_line, bline);

            tAlive = exp2f(laneID + 1); 
            bline = tAlive & cAlive; // the bline value is ok
            bCol = transp_line + __popc(bline); // but __popc(bline) become always 1


            for (j = 0; j < st; j++)
                kpL[bCol + j * stb] = t_initdwl[threadIdx.x][j] + i;
        }
    }
}

提前谢谢!

解决方案

感谢所有支持!

在阅读@njuffa并测试代码变体后,它会起作用.

先前对LaneID进行位掩码的方法不正确.
使用exp2f仅会产生第一个和最后一个泳道的预期结果.

tAlive = exp2f(laneID + 1)

替换为:

tAlive = 0xffffffff >> (warpSize - laneID);

所以,现在可以正常工作了.

修改: 今天,我偶然发现了在ForALL的Parallel中发布的帖子很好地解释了如何使用随机播放,投票和popc来过滤有助于主要结果的线程.

由于我以前没有这么做,所以我希望此编辑可以帮助某人找到它.

此致

I'm having some issues while trying to save some shared memory to improve the processing. In this code every thread pick a template index (t_initdwl) from host and expand it until the upper limit before the next kernel process it.

The idea is to use shuffle + ballot to "reserve" the right amount of space to write only indexes from threads that did not exceeded the upper limit. Since it is an array of limits, first every col limit is tested then it is written. The number of "expanded entries" varies according to the values in the initial index but, but once the thread surpass the limit, any further increment are not valid, so it sets play=false (just to be safe) and returns.

The issue is the bCol value that is always trasnp_line + 1, giving the idea that popc is not working correctly in this line only, since 1 is the correct value for the lineID 0 only. I'm not getting any error,

The expected results is the correct position to the line to write, taking in account the number of lines that still "playing" (have not returned yet).

__global__ void dwLgen(const int maxthreads, short* __restrict__ kpL, int* nkpl, 
        const short* __restrict__ init_dwL, const short rloops){
    __shared__ short t_initdwl[1024][24];

    const int warpID(threadIdx.x / warpSize);
    const int laneID(threadIdx.x % warpSize);
    const int st(rloops + 2);
    const int stb((p - kpMax + 1 + 1) * BUFFERSIZE_MAX); 
    const int idx = blockDim.x * blockIdx.x + threadIdx.x;
    unsigned int cAlive, tAlive, bCol, bline, transp_line;  
    int i, j; 
    bool volatile play(true);

    if (idx < maxthreads) {
        for (j = 0; j < st; j++){
            t_initdwl[threadIdx.x][j] = init_dwL[idx + j * BUFFERSIZE_MAX];
        }
        __syncthreads();
        for (i = 0; i < p; i++) {
            for (j = 0; j < st; j++)
                if ((t_initdwl[threadIdx.x][j] + i) > dwLt[j]) {
                    play = false;
                    return;
                }

            cAlive = __ballot((play == true));
            tAlive = __popc(cAlive);
            bline = __ffs(cAlive) - 1;
            if (laneID == bline)
                transp_line = atomicAdd(nkpl, tAlive);
            transp_line = __shfl(transp_line, bline);

            tAlive = exp2f(laneID + 1); 
            bline = tAlive & cAlive; // the bline value is ok
            bCol = transp_line + __popc(bline); // but __popc(bline) become always 1


            for (j = 0; j < st; j++)
                kpL[bCol + j * stb] = t_initdwl[threadIdx.x][j] + i;
        }
    }
}

thanks in advance!

解决方案

thanks for all the support!

After reading the @njuffa and test a code variation it work.

The previous approach to bitmask the laneID was incorrect.
Using exp2f only produced the expected result for the 1st and last lane.

tAlive = exp2f(laneID + 1)

was replaced by:

tAlive = 0xffffffff >> (warpSize - laneID);

So, now it works, just fine.

Edit: Today I found by a chance a post in Parallel for ForALL, that explain very well how to use shuffle, ballot and popc in order to filter threads that contribute to the main result.

I hope that this edit help someone to find it, since I did not in the past.

Regards,

这篇关于仅使用随机播放+选票+ popc从经纱中的选定车道写入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆