错误:CUDA同步时的BFS [英] Error: BFS on CUDA Synchronization

查看:218
本文介绍了错误:CUDA同步时的BFS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的下面的代码得到一个错误,当它运行时,一些图权重被覆盖,但不应该发生与Xa数组(它保留哪些已被访问)和__syncthreads()函数。 ..

My following code got an error, when it runs, some of the graph weights are being overwritten, but that should not be happening with the Xa array(which keeps which ones have already been visited) and the __syncthreads() function... May someone help?

struct Node 
{
    int begin;     // begining of the substring
    int num;    // size of the sub-string 
};

__global__ void BFS (Node *Va, int *Ea, bool *Fa, bool *Xa, int *Ca, bool *parada)
{
    int tid = threadIdx.x;

    if (Fa[tid] == true && Xa[tid] == false)
    {
        Fa[tid] = false; 
        __syncthreads();

        // Va begin is where it's edges' subarray begins, Va is it's
        // number of elements
        for (int i = Va[tid].begin;  i < (Va[tid].begin + Va[tid].num); i++) 
        {           
            int nid = Ea[i];

            if (Xa[nid] == false)
            {
                Ca[nid] = Ca[tid] + 1;
                Fa[nid] = true;
                *parada = true;
            }   
        }    
        Xa[tid] = true;             
    }
}

// The BFS frontier corresponds to all the nodes being processed 
// at the current level.
int main()
{

    //descrição do grafo
    struct Node node[4]; 
    node[0].begin=0; 
    node[0].num=2; 
    node[1].begin=1; 
    node[1].num=0; 
    node[2].begin=2; 
    node[2].num=2; 
    node[3].begin=1; 
    node[3].num=0; 
    int edges[]={1,2,3,1}; 

    bool frontier[4]={false}; 
    bool visited[4]={false}; 
    int custo[4]={0}; 

    int source=0; 
    frontier[source]=true; 

    Node* Va; 
    cudaMalloc((void**)&Va,sizeof(Node)*4); 
    cudaMemcpy(Va,node,sizeof(Node)*4,cudaMemcpyHostToDevice); 

    int* Ea; 
    cudaMalloc((void**)&Ea,sizeof(Node)*4); 
    cudaMemcpy(Ea,edges,sizeof(Node)*4,cudaMemcpyHostToDevice); 

    bool* Fa; 
    cudaMalloc((void**)&Fa,sizeof(bool)*4); 
    cudaMemcpy(Fa,frontier,sizeof(bool)*4,cudaMemcpyHostToDevice); 

    bool* Xa; 
    cudaMalloc((void**)&Xa,sizeof(bool)*4); 
    cudaMemcpy(Xa,visited,sizeof(bool)*4,cudaMemcpyHostToDevice); 

    int* Ca; 
    cudaMalloc((void**)&Ca,sizeof(int)*4); 
    cudaMemcpy(Ca,custo,sizeof(int)*4,cudaMemcpyHostToDevice); 

    dim3 threads(4,1,1); 

    bool para; 
    bool* parada; 
    cudaMalloc((void**)&parada,sizeof(bool)); 
    printf("\n");
    int n=1;
    do{ 
        para=false; 
        cudaMemcpy(parada,&para,sizeof(bool),cudaMemcpyHostToDevice);       
        BFS <<<1,threads>>>(Va,Ea,Fa,Xa,Ca,parada);     
        CUT_CHECK_ERROR("kernel1 execution failed"); 
        cudaMemcpy(&para,parada,sizeof(bool),cudaMemcpyDeviceToHost); 



        printf("Run number: %d >> ",n); 
        cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);  
        for(int i=0;i<4;i++) 
            printf("%d  ",custo[i]); 
        printf("\n");
        n++;

    }while(para); 


    printf("\nFinal:\n");
    cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost); 

    for(int i=0;i<4;i++) 
        printf("%d  ",custo[i]); 
    printf("\n");

}


推荐答案

在该设备代码中的一些相当大的缺陷。首先,你在 Xa Ca 上都有记忆赛事。其次,您有条件地执行 __ syncthreads()调用,这是非法的,并且如果由线程的执行可能导致内核挂起,在调用的任何分支发散

There are a number of pretty major flaws in that device code. Firstly, you have memory races on both Xa and Ca. Secondly, you have a conditionally executed __syncthreads() call, which is illegal and can lead to the kernel hanging if executed by a warp of threads where any branch divergence around the call can occur.

您使用的算法的结构可能在CUDA上不正确,即使您使用原子存储器访问函数来消除最差的pf read-after-write在代码中的比赛。使用原子内存访问将有效地序列化代码,并节省大量的性能。

The structure of the algorithm you are using probably isn't going to be correct on CUDA, even if you were to use atomic memory access functions to eliminate the worst pf read-after-write races in the code as posted. Using atomic memory access will effectively serialise the code and cost a great deal of performance.

对CUDA的广度优先搜索不是一个未解决的问题。如果你想搜索它们,有许多关于实现的好文章。我建议您高性能和可扩展的GPU图形遍历,如果您有还没有看到它。这些作者的实施代码也可从此处下载。

Breadth first search on CUDA isn't an unsolved problem. There are a number of good papers on implementations, if you care to search for them. I would recommend High Performance and Scalable GPU Graph Traversal, if you have not already seen it. The code for those authors' implementation is also available for download from here.

这篇关于错误:CUDA同步时的BFS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆