CUDA中的Floyd-Warshall算法 [英] The Floyd-Warshall algorithm in CUDA

查看:835
本文介绍了CUDA中的Floyd-Warshall算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我想在CUDA中并行执行的顺序代码。

  / * 
顺序线程)CPU上的APSP。
* /
void floyd_sequential(int * mat,const size_t N)
{
for(int k = 0; k for(int i = 0; i for(int j = 0; j {
int i0 = i * N + j;
int i1 = i * N + k;
int i2 = k * N + j;
if(mat [i1]!= -1& mat [i2]!= -1)
mat [i0] =(mat [i0]!= -1& [i0]< mat [i1] + mat [i2])?
mat [i0]:(mat [i1] + mat [i2]);
}
}

这是我的CUDA实现

  // ParallelComputing.cpp:定义控制台应用程序的入口点。 
//

#include< stdio.h>
#include< cuda.h>
#include< stdlib.h>


#define DIMENSION 10;
__global__ void gpu_Floyd(int * result,int N)
{
int j,k;
int Row = blockIdx.y * blockDim.y + threadIdx.y;

for(k = 0; k {
for(j = 0; j {
int i0 = Row * N + j;
int i1 = Row * N + k;
int i2 = k * N + j;
if(result [i0]!= -1&& result [i2]!= -1)
result [i0] =(result [i0]!= -1&& [i0]< result [i1] + result [i2])?
result [i0]:(result [i1] + result [i2]);
__syncthreads();
}
}
}

void GenMatrix(int * mat,const size_t N)
{
for(int i = 0; i mat [i] = rand()%32-1;

}

bool CmpArray(const int * l,const int * r,const size_t eleNum)
{
for(int i = 0; i <$ n $; i ++)
if(l [i]!= r [i])
{
printf(ERROR:l [%d] =%d,r [%d] =%d \\\
,i,l [i],i,r [i]
return false;
}
return true;
}

int main(int argc,char ** argv)
{

//生成一个随机矩阵。
size_t N = 10;
int * mat =(int *)malloc(sizeof(int)* N * N);
GenMatrix(mat,N);

//计算参考结果。
int * ref =(int *)malloc(sizeof(int)* N * N);
memcpy(ref,mat,sizeof(int)* N * N);
Floyd_sequential(ref,N);


// CUDA部分
int Grid_Dim_x = 1,Grid_Dim_y = 1;
int noThreads_x,noThreads_y;
int * result =(int *)malloc(sizeof(int)* N * N);
memcpy(result,mat,sizeof(int)* N * N);
int * d_result;

//计算结果

cudaMalloc((void **)& d_result,N * N);

cudaMemcpy(result,N * N,cudaMemcpyHostToDevice);
gpu_Floyd<<< 1024,256>>(d_result,N);
cudaMemcpy(result,d_result,cudaMemcpyDeviceToHost);

//将结果与引用结果进行比较
if(CmpArray(result,ref,N * N))
printf(Matrix matches.\\\
);
else
printf(矩阵不匹配。\\\
);

free(ref);
free(result);
cudaFree(d_result);
}

但是,我的输出总是显示矩阵不匹配。

我理解,在CUDA中,我们尝试将矩阵中的每个元素映射到每一行。然而,我试图探讨可能性,通过映射矩阵的每一行到一个线程。

解决方案

如前所述,您提供的GPU代码无法编译,所以我很好奇您如何进行观察您的输出矩阵不匹配。
以下是您的代码的一些问题:




  • cudaMalloc ,就像 malloc 分配字节,因此这是不正确的:

      cudaMalloc((void **)& d_result,N * N); 






  cudaMalloc((void **)& d_result,N * N * sizeof(int)); 

 


  • 同样 cudaMemcpy ,就像 memcpy ,操作在字节,此外 cudaMemcpy 4参数,因此这是不正确的:

      cudaMemcpy(result,N * N,cudaMemcpyHostToDevice); 




,您可能想要这样:

  cudaMemcpy(d_result,result,N * N * sizeof(int),cudaMemcpyHostToDevice); 

,而您的其他 cudaMemcpy

  • 您的个人资料内核被写为好像它期望一个2维线程数组,或者在 y 中至少有一个维度,而你在 x中启动一个一维网格

      gpu_Floyd<<<< 1024,256>>(d_result,N ); 




  • y 将始终为1或0,此行代码如下:

      int Row = blockIdx.y * blockDim.y + threadIdx.y; $ 1 

    会在1-D网格中的所有主题 x




    • 您的gpu内核会将结果放在与输入数据。对于顺序代码,这可能或可能不重要,但是对于旨在并行运行的代码,它常常会导致竞争条件,因为操作顺序(即线程执行顺序)在很大程度上未定义。


    This is the sequential piece of code I am trying to parallelize in CUDA

    /*
        Sequential (Single Thread) APSP on CPU.
    */
    void floyd_sequential(int *mat, const size_t N)
    {
        for(int k = 0; k < N; k ++)
            for(int i = 0; i < N; i ++)
                for(int j = 0; j < N; j ++)
                {
                    int i0 = i*N + j;
                    int i1 = i*N + k;
                    int i2 = k*N + j;
                    if(mat[i1] != -1 && mat[i2] != -1)
                        mat[i0] = (mat[i0] != -1 && mat[i0] < mat[i1] + mat[i2]) ?
                          mat[i0] : (mat[i1] + mat[i2]);
                }
    }
    

    This is my CUDA implementation

    // ParallelComputing.cpp : Defines the entry point for the console application.
    //
    
    #include <stdio.h>
    #include <cuda.h>
    #include <stdlib.h>
    
    
    #define DIMENSION 10;
    __global__ void gpu_Floyd(int *result, int N)
    {
        int j,k;
        int Row = blockIdx.y * blockDim.y + threadIdx.y;
    
        for(k = 0; k < N; k++)
        {
            for(j = 0; j < N; j++)
            {
                int i0 = Row * N + j;  
                int i1 = Row * N + k;
                int i2 = k * N + j;
                if(result[i0] != -1 && result[i2] != -1)
                        result[i0] = (result[i0] != -1 && result[i0] < result[i1] + result[i2]) ?
                          result[i0] : (result[i1] + result[i2]);
                __syncthreads();
            }
        }
    }
    
       void GenMatrix(int *mat, const size_t N)
    {
        for(int i = 0; i < N*N; i ++)
            mat[i] = rand()%32 - 1;
    
    }
    
    bool CmpArray(const int *l, const int *r, const size_t eleNum)
    {
        for(int i = 0; i < eleNum; i ++)
            if(l[i] != r[i])
            {
                printf("ERROR: l[%d] = %d, r[%d] = %d\n", i, l[i], i, r[i]);
                return false;
            }
        return true;
    }
    
    int main(int argc, char **argv)
    {
    
    // generate a random matrix.
    size_t N = 10;
    int *mat = (int*)malloc(sizeof(int)*N*N);
    GenMatrix(mat, N);
    
    // compute the reference result.
    int *ref = (int*)malloc(sizeof(int)*N*N);
    memcpy(ref, mat, sizeof(int)*N*N);
    Floyd_sequential(ref, N);
    
    
    //CUDA Portion
    int Grid_Dim_x = 1, Grid_Dim_y = 1;
    int noThreads_x, noThreads_y;
    int *result = (int*)malloc(sizeof(int)*N*N);
    memcpy(result, mat, sizeof(int)*N*N);
    int *d_result;
    
    // compute your results
    
    cudaMalloc((void **)&d_result, N*N);
    
    cudaMemcpy(result, N * N, cudaMemcpyHostToDevice);
    gpu_Floyd<<<1024, 256>>>(d_result, N);
    cudaMemcpy(result, d_result, cudaMemcpyDeviceToHost);
    
    // compare your result with reference result
    if(CmpArray(result, ref, N*N))
        printf("The matrix matches.\n");
    else
        printf("The matrix do not match.\n");
    
    free(ref);
    free(result);
    cudaFree(d_result);
    }
    

    However, my output always shows the matrices do not match.

    I understand that in CUDA we try to map each element in the matrix to each row. However, I am trying to explore possibilities by mapping each row of the matrix to a thread instead.

    解决方案

    As has already been mentioned, your provided GPU code does not compile, so I'm curious how you got to the observation that your output matrices do not match. Here are some of the problems with your code:

    • cudaMalloc, just like malloc allocates bytes, so this is not correct:

      cudaMalloc((void **)&d_result, N*N);
      

    instead you want this:

        cudaMalloc((void **)&d_result, N*N*sizeof(int));
    

    • likewise cudaMemcpy, just like memcpy, operates on bytes, and furthermore cudaMemcpy requires 4 parameters so this is not correct:

      cudaMemcpy(result, N * N, cudaMemcpyHostToDevice);
      

    instead you probably want this:

        cudaMemcpy(d_result, result, N * N*sizeof(int), cudaMemcpyHostToDevice);
    

    and your other cudaMemcpy line needs to be fixed similarly.

    • I'd also advise doing proper cuda error checking
    • Your kernel is written as if it's expecting a 2 dimensional thread array, or at least one dimensional in y, whereas you are launching a one dimensional grid in x:

      gpu_Floyd<<<1024, 256>>>(d_result, N);
      

    therefore all your kernel built-in variables in y will be 1 or 0 always, and this line of code:

            int Row = blockIdx.y * blockDim.y + threadIdx.y;
    

    will evaluate to zero for all threads in your 1-D grid in x.

    • Your gpu kernel is putting the results in the same matrix as the input data. For sequential code this may or may not matter, but for code that is intended to run in parallel, it can often lead to race conditions, because the order of operations (i.e. order of thread execution) is largely undefined.

    这篇关于CUDA中的Floyd-Warshall算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆