在CUDA中增加每个线程的工作的示例 [英] Example of increasing the work per thread in CUDA

查看:159
本文介绍了在CUDA中增加每个线程的工作的示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介:首先,作为介绍,我非常自豪提出我对StackOverflow的第一个问题。



算法



我正在使用CUDA编写程序,问题如下:




  • (n * 128)和B(m * 128)


  • 我取A的第一行,计算该向量和所有行之间的距离


  • 我将每个距离的结果写入矩阵C的行,因此元素C(i,j) C包含A的第i行和B的第j行之间的距离。


  • 我继续A的下一行。




我这样实现:我有一个由(n * m)个块组成的网格,每个块有128个线程。 (1 * 128)。



QUESTION :程序成功运行并带有预期结果,但时间执行速度只有大约5到10倍比它的单线程CPU版本。因此,我想知道如何在减少之前增加每个线程的工作以提高性能



内核代码:未优化)

  __global__ void EuclideanDistances(float * A,float * B,float * C,int n ,int m)
{
// SIZE等于128
__shared__ float accumResult [SIZE];
float sA;
float sB;

// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1


sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();


accumResult [ty] =(sA - sB)*(sA - sB);
__syncthreads();


//并行树缩减
for(int stride = SIZE / 2; stride> 0; stride>> = 1)
if ty< stride)
{
accumResult [ty] + = accumResult [stride + ty];
__syncthreads();
}

//将结果写入输出矩阵
if((threadIdx.y == 0))
C [bx * m + by] = accumResult [ ty];
__syncthreads();
}

UPDATE



现在,我使用另一个映射:而不是通过 m n 的网格>块和一个 128 线程的块,我增加一个块中的线程数,以减少块数。



新映射:



封锁 128 code> 8 线程(总共1024个线程,这是最大大小)



n / 8 m / 8



不幸的是,它的结果错误。



优化的内核代码(待更新)

  __ global__ void EuclideanDistances(float * A,float * B,float * C,int n,int m)
{
__shared__ float accumResult [SIZE] [8]
__shared__ float sA [SIZE] [8];
__shared__ float sB [SIZE] [8];

int bx = blockIdx.x; // n / 8
int by = blockIdx.y; // m / 8
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = bx * tx * SIZE + ty;
int j = by * tx * SIZE + ty;

sA [ty] [tx] = A [i];
sB [ty] [tx] = B [j];
__syncthreads();


accumResult [ty] [tx] =(sA [ty] [tx] - sB [ty] [tx])*(sA [ty] [tx] - sB ] [tx]);
__syncthreads();

//减少
for(int stride = SIZE / 2; stride> 0; stride>> = 1)
if(ty< stride)
{
accumResult [ty] [tx] + = accumResult [stride + ty] [tx];
__syncthreads();
}

C [bx * m + by] = accumResult [0] [tx];
}

HOST CODE(分配+内核调用) / p>

  int main()
{
int m = 20000; // MatrixA size:m * SIZE
int n = 4000; // MatrixB size:n * SIZE

srand((unsigned)time(0));

//主机分配
float * matrixA =(float *)malloc(n * SIZE * sizeof(float));
for(int i = 0; i matrixA [i] =(float)(rand()%100)+1;

float * matrixB =(float *)malloc(m * SIZE * sizeof(float));
for(int i = 0; i matrixB [i] =(float)(rand()%100)+1;

float * results_kernel1 =(float *)malloc(n * m * sizeof(float));
float * results_kernel2 =(float *)malloc(n * m * sizeof(float));


//设备分配
float * d_matrixA;
float * d_matrixB;
cudaMalloc((void **)& d_matrixA,n * SIZE * sizeof(float));
cudaMalloc((void **)& d_matrixB,m * SIZE * sizeof(float));
cudaMemcpy(d_matrixA,matrixA,n * SIZE * sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_matrixB,matrixB,m * SIZE * sizeof(float),cudaMemcpyHostToDevice);

float * d_results_kernel1;
float * d_results_kernel2;
cudaMalloc((void **)& d_results_kernel1,n * m * sizeof(float));
cudaMalloc((void **)& d_results_kernel2,n * m * sizeof(float));

dim3 threads1(1,128);
dim3 blocks1(n,m);
EuclideanDistances1<<<< blocks1,threads1>>> (d_matrixA,d_matrixB,d_results_kernel1,n,m);
cudaDeviceSynchronize();
cudaMemcpy(results_kernel1,d_results_kernel1,n * m * sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(d_results_kernel1);

dim3 threads2(8,128); //每个块1024个线程(最大)
dim3 blocks2(ceil((float)n / 8),ceil((float)m / 8)
EuclideanDistances2<<< block2,threads2>> (d_matrixA,d_matrixB,d_results_kernel2,n,m);
cudaDeviceSynchronize();
cudaMemcpy(results_kernel2,d_results_kernel2,n * m * sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(d_results_kernel2);

//显示和比较结果
for(int i = 0; i <50; i ++)
std :: cout < kernel1:< results_kernel1 [i]<< | kernel2:< results_kernel2 [i]<< std :: endl;

free(matrixA);
free(matrixB);
免费(results_kernel1);
free(results_kernel2);

return 0;
}

PS :我有CUDA 6.0

解决方案

似乎您的问题有两个组件:


  1. 为什么我的第二个内核不工作?

  2. 如何让我的代码运行得更快? li>




为什么我的第二个内核不工作?


您有几个问题:


  1. 在初始计算 i时索引问题, j 以及用于存储 C 值的索引。
  2. 违反使用情况

项目1的原因是: _syncthreads()


如何使我的代码运行速度更快?


< blockquote>

这更涉及。首先,你尝试增加每个线程的工作量没有做任何事情,这只是每个块的线程数量增加(从128到8 * 128)。每个线程都做了大致相同的工作量。此外,在进行2D线程块的这个尝试的过程中,我相信一些坏事发生了:


  1. 各种合并和共享 - 内存冲突加载和存储模式已损坏。

  2. 有效占用率下降,由于每个块需要共享内存量。

第二个内核的净效果是大约使执行时间加倍。所以这不是我们想要的。



然而,增加每个线程的工作可能是一个好主意,使用共享内存,以及尝试保持良好,共享)存储器访问模式,以及允许增加占用率。



以下是沿着这些线的工作进展。以下代码将您的第二个内核,定时基础结构,以及完整的数据验证,以及2个新的内核固定。第一个新内核(#3)是我所谓的天真内核。它简单地为每个输出点分配一个线程,并且每个线程循环通过必要的向量,计算其单独的结果。没有使用共享内存,甚至更加注意合并或任何其他优化。但是对于threadblock配置(16,16) - >(8,32)线程,我从@talonmies回答(现在删除)观察到,这个内核执行显着(3x)比你的快速内核快。在进一步思考(8,32)观察后,我得出结论,下一次优化尝试应该集中在:


  1. 消除使用

  2. 从高速缓存中获益的最大化
  3. $ b()允许相邻线程使用直接for循环来遍历向量。 $ b
  4. 高效使用共享内存

  5. 坚持完美的全局合并/完美使用共享内存,用于所有读写操作

项目4提示了我可以转置矩阵吗?使用此权限,可以重新组织数据,以便于上述第4项。上面的项2通过将B向量加载到共享存储器中而在我的快速内核(#4)中寻址,同时允许高速缓存主要集中在缓存A向量上,希望减少高速缓存 - 颠簸(A是2矢量阵列,大约2MB - 费米L2是768K,开普勒L2是1.5MB)。通过以转置形式传递A,并且从共享存储器有效地转置B,可以使用直接for循环来计算向量距离,同时允许相邻线程具有完美的合并读取和写入,以及



对于我的特定时序,(Quadro5000 cc2.0 GPU,CUDA 6),高效使用共享内存(即非银行冲突负载和广播读取) ,RHEL 5.5)我看到你的快速内核需要大约2秒,我的幼稚内核需要大约0.7秒,我的快速内核需要大约0.2秒,虽然有转置(A,C)数据。



EDIT:我进行了一次额外的优化,即每个块计算多个( CHKSIZE )一次B个向量。您可以将CHKSIZE设置为1,以查看以前的结果(约0.2秒)。我发现CHKSIZE的4有很好的改善。这是试图利用A的数据重用的攻击。通过CHKSIZE为4的额外优化,内核4的内核时间下降到大约0.1秒。



以下是代码和运行示例:

  $ cat t460.cu 
#include< stdio。 h。
#include< stdlib.h>
#include< iostream>

// M和N都必须可以被SIZE整除,M必须可以被CHKSIZE整除
#define SIZE 128
#define N 4000
#define M 20000
#define CHKSIZE 4

__global__ void EuclideanDistances1(float * A,float * B,float * C,int n,int m)
{
/ / SIZE等于128
__shared__ float accumResult [SIZE];
float sA;
float sB;

// MAPPING
int bx = blockIdx.x; // n
int by = blockIdx.y; // m
int ty = threadIdx.y; // 128
// int tx = threadIdx.x; // 1

sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();

accumResult [ty] =(sA - sB)*(sA - sB);
__syncthreads();

//并行树缩减
for(int stride = SIZE / 2; stride> 0; stride>> = 1){
if(ty < stride)
{
accumResult [ty] + = accumResult [stride + ty];
}
__syncthreads();
}

//将结果写入输出矩阵
if((ty == 0))
C [bx * m + by] = accumResult [ty] ;
__syncthreads();
}

__global__ void EuclideanDistances2(float * A,float * B,float * C,int n,int m)
{
__shared__ float accumResult [SIZE] [8];
__shared__ float sA [SIZE] [8];
__shared__ float sB [SIZE] [8];

int bx = blockIdx.x; // n / 8
int by = blockIdx.y; // m
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i =((bx * 8)+ tx)* SIZE + ty;
int j = by * SIZE + ty;

sA [ty] [tx] = A [i];
sB [ty] [tx] = B [j];
__syncthreads();

accumResult [ty] [tx] =(sA [ty] [tx] - sB [ty] [tx] );
__syncthreads();

//减少
for(int stride = SIZE / 2; stride> 0; stride>> = 1){
if(ty< stride)
{
accumResult [ty] [tx] + = accumResult [stride + ty] [tx];
}
__syncthreads();
}

if(ty == 0)
C [((bx * 8)+ tx)* m + by] = accumResult [0] [tx]
}
// naive kernel
__global__ void EuclideanDistances3(float * A,float * B,float * C,int n,int m){
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int idy = threadIdx.y + blockDim.y * blockIdx.y;
float result = 0.0f;

if((idx for(int i = 0; i float temp = A [(idx * SIZE)+ i] -B [(idy * SIZE)+ i];
result + = temp * temp;}
C [(idx * m)+ idy] = result;
}
}
//优化的内核
__global__ void EuclideanDistances4(const float * A,const float * B,float * C,const int n,const int m){
// n,A,4000这个内核假设A是列主语A(SIZE,n)
// m,B,20000这个内核假设B是行主B b $ b //这个内核假设C是列主元C(m,n)
//这个内核假设每个线程块的线程数量== SIZE
// CHKSIZE是B向量的数量将计算每个块
__shared__ float my_sB [CHKSIZE * SIZE]; //对于CHKSIZE向量的足够的共享存储B
int bx = blockIdx.x; //每个CHKSIZE行的一个块(更大的输入矩阵)
while((bx * CHKSIZE) for(int i = 0; i my_sB [(i * SIZE)+ tx] = B [((bx * CHKSIZE )+ i)* SIZE)+ tx];
__syncthreads();
while(tx float result [CHKSIZE];
for(int i = 0; i result [i] = 0.0f;
for(int i = 0; i float Atemp = A [(n * i)+ tx];
for(int j = 0; j float temp = Atemp-my_sB [i +(j * SIZE)];
result [j] + = temp * temp;}}
for(int i = 0; i C [((i + * CHKSIZE))* n)+ tx] = result [i];
tx + = blockDim.x; } //继续循环遍历A
__syncthreads()中的向量; //如果使用块循环,则必须防止翘曲前进
bx + = gridDim.x;}
}

float comp_euclid_sq(const float * rA,const float * rB,const int size){

float result = 0.0f;
float temp;
for(int i = 0; i temp =(rA [i] -rB [i]);
result + = temp * temp;}
return result;
}

int main()
{
float et1 = 0.0f,et2 = 0.0f,et3 = 0.0f,et4 = 0.0f;
cudaEvent_t start1,start2,start3,start4,stop1,stop2,stop3,stop4;
cudaEventCreate(& start1);
cudaEventCreate(& start2);
cudaEventCreate(& start3);
cudaEventCreate(& start4);
cudaEventCreate(& stop1);
cudaEventCreate(& stop2);
cudaEventCreate(& stop3);
cudaEventCreate(& stop4);

int n = N; // MatrixA size:n * SIZE
int m = M; // MatrixB size:m * SIZE

srand((unsigned)time(0));

//主机分配
float * matrixA =(float *)malloc(n * SIZE * sizeof(float));
for(int i = 0; i matrixA [i] =(float)(rand()%100)+1;

float * matrixB =(float *)malloc(m * SIZE * sizeof(float));
for(int i = 0; i matrixB [i] =(float)(rand()%100)+1;

float * results_kernel =(float *)malloc(n * m * sizeof(float));
float * cpu_results_kernel =(float *)malloc(n * m * sizeof(float));
for(int i = 0; i cpu_results_kernel [i] = comp_euclid_sq(matrixA +((i / m)* SIZE),matrixB + SIZE,SIZE);

//设备分配
float * d_matrixA;
float * d_matrixB;
cudaMalloc((void **)& d_matrixA,n * SIZE * sizeof(float));
cudaMalloc((void **)& d_matrixB,m * SIZE * sizeof(float));
cudaMemcpy(d_matrixA,matrixA,n * SIZE * sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_matrixB,matrixB,m * SIZE * sizeof(float),cudaMemcpyHostToDevice);

float * d_results_kernel;
cudaMalloc((void **)& d_results_kernel,n * m * sizeof(float));


dim3 threads1(1,SIZE);
dim3 blocks1(n,m); $ b $ c cudaEventRecord(start1);
EuclideanDistances1<<<< blocks1,threads1>>> (d_matrixA,d_matrixB,d_results_kernel,n,m);
cudaEventRecord(stop1);
cudaMemcpy(results_kernel,d_results_kernel,n * m * sizeof(float),cudaMemcpyDeviceToHost);
for(int i = 0; i if(results_kernel [i]!= cpu_results_kernel [i]){printf(cpu / kernel1 mismatch at%d,cpu :%f,kernel1:%f \\\
,i,cpu_results_kernel [i],results_kernel [i]); return 1;}}
cudaMemset(d_results_kernel,0,n * m * sizeof(float));
cudaEventSynchronize(stop1);
cudaEventElapsedTime(& et1,start1,stop1);

dim3 threads2(8,SIZE); //每个块1024个线程(最大)
dim3 blocks2(n / 8,m); //假设n可被8整除
cudaEventRecord(start2);
EuclideanDistances2<<< block2,threads2>> (d_matrixA,d_matrixB,d_results_kernel,n,m);
cudaEventRecord(stop2);
cudaMemcpy(results_kernel,d_results_kernel,n * m * sizeof(float),cudaMemcpyDeviceToHost);
for(int i = 0; i if(results_kernel [i]!= cpu_results_kernel [i]){printf(cpu / kernel2 mismatch at%d,cpu :%f,kernel1:%f \\\
,i,cpu_results_kernel [i],results_kernel [i]); return 1;}}
cudaMemset(d_results_kernel,0,n * m * sizeof(float));
cudaEventSynchronize(stop2);
cudaEventElapsedTime(& et2,start2,stop2);

cudaFuncSetCacheConfig(EuclideanDistances3,cudaFuncCachePreferL1);
dim3 threads3(8,32); //每个块1024个线程(最大)
dim3 blocks3(n / threads3.x,m / threads3.y); //假设可以整除
cudaEventRecord(start3);
EuclideanDistances3<<< blocks3,threads3>> (d_matrix A,d_matrixB,d_results_kernel,n,m);
cudaEventRecord(stop3);
cudaMemcpy(results_kernel,d_results_kernel,n * m * sizeof(float),cudaMemcpyDeviceToHost);
for(int i = 0; i if(results_kernel [i]!= cpu_results_kernel [i]){printf(cpu / kernel3 mismatch at%d,cpu :%f,kernel3:%f \\\
,i,cpu_results_kernel [i],results_kernel [i]); return 1;}}
cudaMemset(d_results_kernel,0,n * m * sizeof(float));
cudaEventSynchronize(stop3);
cudaEventElapsedTime(& et3,start3,stop3);

//转置矩阵A
float * matrixA_T =(float *)malloc(n * SIZE * sizeof(float));
for(int i = 0; i for(int j = 0; j matrixA_T [(j * n)+ i] = matrixA [(i * SIZE)+ j];
cudaMemcpy(d_matrixA,matrixA_T,n * SIZE * sizeof(float),cudaMemcpyHostToDevice);

cudaFuncSetCacheConfig(EuclideanDistances4,cudaFuncCachePreferL1);
dim3 threads4(SIZE); //每个向量元素一个线程
dim3 blocks4(m / CHKSIZE);
cudaEventRecord(start4);
EuclideanDistances4<<< blocks4,threads4>> (d_matrixA,d_matrixB,d_results_kernel,n,m);
cudaEventRecord(stop4);
cudaMemcpy(results_kernel,d_results_kernel,n * m * sizeof(float),cudaMemcpyDeviceToHost);
//测试正确的转置结果C(m,n)
for(int i = 0; i for(int j = 0; j < (cpu / kernel4不匹配在%d,%d,cpu:%f(*)) ,kernel4:%f \\\
,i,j,cpu_results_kernel [(i * m)+ j],results_kernel [(j * n)+ i] return 1;}
cudaEventSynchronize(stop4);
cudaEventElapsedTime(& et4,start4,stop4);
cudaFree(d_results_kernel);

printf(Success!\\\
);
printf(kernel1:%.fms,kernel2:%.fms,kernel3:%.fms,kernel4:%.fms\\\
,et1,et2,et3,et4)

free(matrixA);
free(matrixB);
免费(results_kernel);

return 0;
}

$ nvcc -O3 -arch = sm_20 -o t460 t460.cu
$ ./t460
成功!
kernel1:2213ms,kernel2:4660ms,kernel3:691ms,kernel4:99ms
$



希望这将让你去更多的想法的东西工作。您可能在cc3.0设备上获得不同的时间。



是否可以进一步优化?大概。我将看到的第一个目标是找出如何利用向量A上的数据重用机会(向量B的​​数据重用已经通过将其加载到共享存储器中而在内核4中被处理)是使用一些共享内存来存储A的部分以使代码运行得更快的方法。)



我想我还应该提到,只要此代码计算欧氏距离正方形。对内核的微小修改可以使它计算实际的欧几里得距离( C [...] = sqrtf(...); )我已经包括验证,但是,假设结果是范围内的,用于在 float 中完整存储整数数量。您的测试用例满足此要求,否则验证代码需要修改(如果使用 sqrtf )。


Intro : First and as an introduction, i am quite "proud" to ask my first question on StackOverflow. I hope I'll be able to help other people as much as they help me.

Algorithm :

I'm writing a program with CUDA and the problem is the following:

  • Two matrices A (n * 128) and B (m * 128)

  • I take the first row of A, and I compute the distance between that vector and all the rows of B, one by one.

  • I write the result of each distance on a row of a matrix C, so the element C(i,j) of C contains the distance between row i of A and row j of B.

  • and I proceed with the next row of A.

I've implemented it this way: I've got a grid made by ( n * m ) blocks, and 128 threads per block. ( 1 * 128 ).

QUESTION: The program runs successfully with the expected results but the time execution is only around 5 to 10 times faster than the one-threaded CPU version of it. So I would like to know how to increase the work per thread before reduction in order to increase performance.

Kernel code (original : Not optimized)

 __global__ void EuclideanDistances( float *A, float *B , float *C , int n , int m)
{
    // SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;

    // MAPPING
int bx = blockIdx.x;  // n
int by = blockIdx.y;  // m
int ty = threadIdx.y; // 128
int tx = threadIdx.x; // 1


sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();


accumResult[ty] = (sA - sB) * (sA - sB);
__syncthreads();


// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1)
    if (ty < stride)
    {
        accumResult[ty] += accumResult [stride + ty];
          __syncthreads();
    }

    // Writing results to output matrix
if ((threadIdx.y == 0))
    C [bx * m + by] = accumResult[ty];
       __syncthreads();
}

UPDATE

Now, I'm using another mapping : Instead of taking a grid of n by m blocks and a block of 128 threads, I'm increasing the number of threads within a block in order to decrease the number of blocks.

New mapping:

Block of 128 by 8 threads (total of 1024 threads, which is the max size)

Grid of n/8 by m/8 blocks

Unfortunately, it's giving wrong results ).

Optimized kernel code (to be updated)

__global__ void EuclideanDistances( float *A, float *B , float *C, int n , int m)
{
    __shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];

int bx = blockIdx.x;  // n / 8
int by = blockIdx.y;  // m / 8
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = bx * tx * SIZE + ty;
int j = by * tx * SIZE + ty;

sA[ty][tx] = A [i];
sB[ty][tx] = B[j];
__syncthreads();


accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
__syncthreads();

// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1)
    if (ty < stride)
    {
        accumResult[ty][tx] += accumResult [stride + ty][tx];
        __syncthreads();
    }

    C[bx *  m + by] = accumResult[0][tx];
}

HOST CODE (allocations + kernel calls)

    int main()
{
     int m = 20000; //MatrixA size : m * SIZE
     int n = 4000;  //MatrixB size : n * SIZE

     srand((unsigned)time(0));

     // Host Allocations
     float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
     for(int i=0; i < n * SIZE; i++)
         matrixA[i] = (float) (rand()%100)+1;

     float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
     for(int i=0; i < m * SIZE; i++)
         matrixB[i] = (float) (rand()%100)+1;

     float *results_kernel1 = (float *) malloc (n * m * sizeof(float));
     float *results_kernel2 = (float *) malloc (n * m * sizeof(float));


     //Device Allocation
     float *d_matrixA;
     float *d_matrixB;
     cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
     cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
     cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
     cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);

     float *d_results_kernel1;
     float *d_results_kernel2;
     cudaMalloc((void **)&d_results_kernel1 , n * m * sizeof(float));
     cudaMalloc((void **)&d_results_kernel2 , n * m * sizeof(float));

     dim3 threads1 (1 , 128);
     dim3 blocks1  (n , m);
     EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel1 , n , m);
     cudaDeviceSynchronize();
     cudaMemcpy(results_kernel1 , d_results_kernel1 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
     cudaFree(d_results_kernel1);

     dim3 threads2 (8 , 128);   // 1024 threads per block (maximum)
     dim3 blocks2  (ceil((float)n/8) , ceil((float)m/8));
     EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel2 , n , m);
     cudaDeviceSynchronize();
     cudaMemcpy(results_kernel2 , d_results_kernel2 , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
     cudaFree(d_results_kernel2);

     // Visualising and comparing results
     for (int i = 0 ; i < 50 ; i++)
         std::cout << "kernel1 : " << results_kernel1[i] << "  |  kernel2 : " << results_kernel2[i] << std::endl;

     free(matrixA);
     free(matrixB);
     free(results_kernel1);
     free(results_kernel2);

     return 0;
}

PS: I have CUDA 6.0 with a NVIDIA GTX 650 (compute capability 3.0)

解决方案

It seems your question has 2 components:

  1. why isn't my second kernel working?
  2. how do I make my code run faster?

Why isn't my second kernel working?

You had several issues:

  1. indexing problems in initial calculation of i, j as well as the index for storing the C value.
  2. violation of usage of _syncthreads() inside a conditional block

item 1 was the key element to get the code working.

How do I make my code run faster?

This is more involved. First of all, your attempt at "increasing work per thread" didn't do anything of the kind, it was merely an increase in the number of threads per block (from 128 to 8*128). Each thread was doing approximately the same amount of work. Furthermore, in the process of going to a 2D threadblock for this attempt, I believe a couple of bad things happened:

  1. various coalescing and shared-memory-bank-conflict load and store patterns were broken.
  2. effective occupancy went down, due the amount of shared memory required per block.

The net effect of the second kernel was to approximately double the execution time. So that is not what we want.

However, increasing work per thread may be a good idea, along with using shared memory, as well as trying to preserve good (global, shared) memory access patterns, as well as allowing for increased occupancy.

What follows is a work-in-progress along those lines. The following code has your second kernel fixed, along with timing infrastructure, as well as full data verification, as well as 2 new kernels. The first new kernel (#3) is what I would call a "naive" kernel. It simply allocates one thread per output point, and each thread loops through the necessary vectors, computing its individual result. No usage of shared memory, or even much attention to coalescing or any other optimization. However with a tweak to threadblock configuration (16,16) -> (8,32) threads, which I observed from @talonmies answer (now deleted), this kernel performs significantly (3x) faster than your "fast" kernel. After further thought about the (8,32) observation, I concluded that the next attempt at optimization should focus on:

  1. elimination of the usage of a parallel reduction to compute the vector distance (i.e. allow adjacent threads to use a straight for-loop to loop through the vectors)
  2. maximization of benefit from the cache
  3. efficient usage of shared memory
  4. insist on perfect global coalescing/perfect usage of shared memory for all reads and writes

Item 4 prompted the question in the comments "may I transpose the matrices?" With this permission, it's possible to re-organize the data to facilitate item 4 above. Item 2 above is addressed in my "fast" kernel (#4) by loading the B vector into shared memory, while allowing the cache to mostly focus on caching the A vectors, hopefully reducing cache-thrashing (A is the smaller of the 2 vector arrays, at about 2MB - fermi L2 is 768K, Kepler L2 is 1.5MB). By delivering A in transposed form, and effectively "transposing" B on-chip from shared memory, it's possible to use a straight for-loop to compute the vector distance, while allowing adjacent threads to have perfectly coalesced reads and writes, as well as "efficient" use of shared memory (i.e. non-bank-conflicted loads, and broadcast reads).

For my particular timing, (Quadro5000 cc2.0 GPU, CUDA 6, RHEL 5.5) I see that your "fast" kernel requires about 2 seconds, my "naive" kernel requires about 0.7 seconds, and my "fast" kernel requires about 0.2 seconds, albeit with transposed (A,C) data.

EDIT: I've made one additional optimization, that is to have each block compute multiple (CHKSIZE) B vectors at one time. You can set CHKSIZE to 1 to see the previous result (~0.2sec). I found CHKSIZE of 4 gave good improvement. This is an attack at attempting to exploit the data re-use of A. With this additional optimization at CHKSIZE of 4, the kernel time for kernel 4 drops to about 0.1 second.

Following is the code and a sample run:

$ cat t460.cu 
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

// both M and N must be evenly divisible by SIZE, M must be evenly divisible by CHKSIZE
#define SIZE 128
#define N 4000
#define M 20000
#define CHKSIZE 4

 __global__ void EuclideanDistances1( float *A, float *B , float *C , int n , int m)
{
    // SIZE is equal to 128
__shared__ float accumResult[SIZE];
float sA;
float sB;

    // MAPPING
int bx = blockIdx.x;  // n
int by = blockIdx.y;  // m
int ty = threadIdx.y; // 128
//int tx = threadIdx.x; // 1

sA = A [bx * SIZE + ty];
sB = B [by * SIZE + ty];
__syncthreads();

accumResult[ty] = (sA - sB) * (sA - sB);
__syncthreads();

// Parallel tree-reduction
for (int stride = SIZE/2 ; stride > 0 ; stride >>= 1){
    if (ty < stride)
    {
        accumResult[ty] += accumResult [stride + ty];
    }
          __syncthreads();
  }

    // Writing results to output matrix
if ((ty == 0))
    C [bx * m + by] = accumResult[ty];
       __syncthreads();
}

__global__ void EuclideanDistances2( float *A, float *B , float *C, int n , int m)
{
__shared__ float accumResult[SIZE][8];
__shared__ float sA[SIZE][8];
__shared__ float sB[SIZE][8];

int bx = blockIdx.x;  // n / 8
int by = blockIdx.y;  // m
int tx = threadIdx.x; // 8
int ty = threadIdx.y; // 128
int i = ((bx*8) + tx) * SIZE + ty;
int j = by * SIZE + ty;

sA[ty][tx] = A[i];
sB[ty][tx] = B[j];
__syncthreads();

accumResult[ty][tx] = (sA[ty][tx] - sB[ty][tx]) * (sA[ty][tx] - sB[ty][tx]);
__syncthreads();

// Reduction
for (int stride = SIZE/2 ; stride > 0 ; stride>>=1){
    if (ty < stride)
    {
        accumResult[ty][tx] += accumResult [stride + ty][tx];
    }
    __syncthreads();
  }

if (ty == 0)
    C[((bx*8)+tx) *  m + by] = accumResult[0][tx];
}
//naive kernel
__global__ void EuclideanDistances3( float *A, float *B , float *C, int n , int m){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int idy = threadIdx.y+blockDim.y*blockIdx.y;
  float result = 0.0f;

  if ((idx < n) && (idy < m)){
    for (int i = 0; i < SIZE; i++){
      float temp = A[(idx*SIZE)+i] - B[(idy*SIZE)+i];
      result += temp * temp;}
    C[(idx*m) + idy] = result;
  }
}
//optimized kernel
__global__ void EuclideanDistances4( const float *A, const float *B , float *C, const int n , const int m){
  // n, A,  4000 this kernel assumes A is column-major A(SIZE, n)
  // m, B, 20000 this kernel assumes B is row-major    B(m, SIZE)
  // this kernel assumes C is column-major             C(m,n)
  // this kernel assumes number of threads per threadblock == SIZE
  // CHKSIZE is the number of B vectors that will be compute per block
  __shared__ float my_sB[CHKSIZE*SIZE];  // enough shared storage for CHKSIZE vectors of B
  int bx  = blockIdx.x; // one block per CHKSIZE rows of B (the larger input matrix)
  while ((bx*CHKSIZE) < m){ // not used, this while loop could be used to extend a block to multiple chunks
    int tx  = threadIdx.x;
    for (int i = 0; i < CHKSIZE; i++)  // load vectors of B into shared memory
      my_sB[(i*SIZE)+tx] = B[(((bx*CHKSIZE)+i)*SIZE)+tx];
    __syncthreads();
    while (tx < n){  //loop across all vectors in A
      float result[CHKSIZE];
      for (int i = 0; i < CHKSIZE; i++)
        result[i] = 0.0f;
      for (int i = 0; i < SIZE; i++){
        float Atemp = A[(n*i)+tx];
        for (int j = 0; j < CHKSIZE; j++){ // compute all CHKSIZE B vectors with read of A
          float temp = Atemp - my_sB[i + (j*SIZE)];
          result[j] += temp * temp;}}
      for (int i = 0; i < CHKSIZE; i++) // store CHKSIZE results
        C[((i+(bx*CHKSIZE))*n)+ tx] = result[i];
      tx += blockDim.x;  } // continue looping across vectors in A
    __syncthreads(); // necessary to prevent warps from racing ahead, if block looping is used
    bx += gridDim.x;}
}

float comp_euclid_sq(const float *rA, const float *rB, const int size){

  float result = 0.0f;
  float temp;
  for (int i = 0; i < size; i++){
    temp = (rA[i] - rB[i]);
    result += temp * temp;}
  return result;
}

int main()
{
     float et1=0.0f, et2=0.0f, et3=0.0f, et4=0.0f;
     cudaEvent_t start1, start2, start3,start4, stop1, stop2, stop3, stop4;
     cudaEventCreate(&start1);
     cudaEventCreate(&start2);
     cudaEventCreate(&start3);
     cudaEventCreate(&start4);
     cudaEventCreate(&stop1);
     cudaEventCreate(&stop2);
     cudaEventCreate(&stop3);
     cudaEventCreate(&stop4);

     int n = N;  //MatrixA size : n * SIZE
     int m = M; //MatrixB size : m * SIZE

     srand((unsigned)time(0));

     // Host Allocations
     float *matrixA = (float *) malloc (n * SIZE * sizeof(float));
     for(int i=0; i < n * SIZE; i++)
         matrixA[i] = (float) (rand()%100)+1;

     float *matrixB = (float *) malloc (m * SIZE * sizeof(float));
     for(int i=0; i < m * SIZE; i++)
         matrixB[i] = (float) (rand()%100)+1;

     float *results_kernel = (float *) malloc (n * m * sizeof(float));
     float *cpu_results_kernel = (float *) malloc (n * m * sizeof(float));
     for (int i = 0; i< n*m; i++)
       cpu_results_kernel[i] = comp_euclid_sq(matrixA + ((i/m)*SIZE), matrixB + (i%m)*SIZE, SIZE);

     //Device Allocation
     float *d_matrixA;
     float *d_matrixB;
     cudaMalloc((void **)&d_matrixA, n * SIZE * sizeof(float));
     cudaMalloc((void **)&d_matrixB, m * SIZE * sizeof(float));
     cudaMemcpy(d_matrixA , matrixA , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);
     cudaMemcpy(d_matrixB , matrixB , m * SIZE * sizeof(float) , cudaMemcpyHostToDevice);

     float *d_results_kernel;
     cudaMalloc((void **)&d_results_kernel , n * m * sizeof(float));


     dim3 threads1 (1 , SIZE);
     dim3 blocks1  (n , m);
     cudaEventRecord(start1);
     EuclideanDistances1 <<<blocks1 , threads1>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
     cudaEventRecord(stop1);
     cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
     for (int i = 0; i< n*m; i++) {
       if (results_kernel[i] != cpu_results_kernel[i])  {printf("cpu/kernel1 mismatch at %d, cpu: %f, kernel1: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
     cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
     cudaEventSynchronize(stop1);
     cudaEventElapsedTime(&et1, start1, stop1);

     dim3 threads2 (8 , SIZE);   // 1024 threads per block (maximum)
     dim3 blocks2  (n/8 , m); // assumes n evenly divisible by 8
     cudaEventRecord(start2);
     EuclideanDistances2 <<<blocks2 , threads2>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
     cudaEventRecord(stop2);
     cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
     for (int i = 0; i< n*m; i++) {
       if (results_kernel[i] != cpu_results_kernel[i])  {printf("cpu/kernel2 mismatch at %d, cpu: %f, kernel1: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
     cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
     cudaEventSynchronize(stop2);
     cudaEventElapsedTime(&et2, start2, stop2);

     cudaFuncSetCacheConfig(EuclideanDistances3, cudaFuncCachePreferL1);
     dim3 threads3 (8, 32);   // 1024 threads per block (maximum)
     dim3 blocks3  (n/threads3.x , m/threads3.y); // assumes evenly divisible
     cudaEventRecord(start3);
     EuclideanDistances3 <<<blocks3 , threads3>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
     cudaEventRecord(stop3);
     cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
     for (int i = 0; i< n*m; i++) {
       if (results_kernel[i] != cpu_results_kernel[i])  {printf("cpu/kernel3 mismatch at %d, cpu: %f, kernel3: %f\n", i, cpu_results_kernel[i], results_kernel[i]); return 1;}}
     cudaMemset(d_results_kernel, 0, n*m*sizeof(float));
     cudaEventSynchronize(stop3);
     cudaEventElapsedTime(&et3, start3, stop3);

     // transpose matrix A
     float *matrixA_T = (float *) malloc (n * SIZE * sizeof(float));
       for (int i = 0; i < n; i++)
         for (int j = 0; j < SIZE; j++)
           matrixA_T[(j*n)+i] = matrixA[(i*SIZE)+j];
     cudaMemcpy(d_matrixA , matrixA_T , n * SIZE * sizeof(float) , cudaMemcpyHostToDevice);

     cudaFuncSetCacheConfig(EuclideanDistances4, cudaFuncCachePreferL1);
     dim3 threads4(SIZE); // one thread per vector element
     dim3 blocks4(m/CHKSIZE);
     cudaEventRecord(start4);
     EuclideanDistances4 <<<blocks4 , threads4>>> (d_matrixA , d_matrixB , d_results_kernel , n , m);
     cudaEventRecord(stop4);
     cudaMemcpy(results_kernel , d_results_kernel , n * m *sizeof(float) , cudaMemcpyDeviceToHost);
     // test for correct transposed result C(m,n)
     for (int i = 0; i< n; i++)
       for (int j = 0; j < m; j++)
         if (results_kernel[(j*n)+i] != cpu_results_kernel[(i*m)+j])  {printf("cpu/kernel4 mismatch at %d,%d, cpu: %f, kernel4: %f\n", i,j, cpu_results_kernel[(i*m)+j], results_kernel[(j*n)+i]); return 1;}
     cudaEventSynchronize(stop4);
     cudaEventElapsedTime(&et4, start4, stop4);
     cudaFree(d_results_kernel);

     printf("Success!\n");
     printf("kernel1 : %.fms, kernel2 : %.fms, kernel3 : %.fms, kernel4 : %.fms\n", et1, et2, et3, et4);

     free(matrixA);
     free(matrixB);
     free(results_kernel);

     return 0;
}

$ nvcc -O3 -arch=sm_20 -o t460 t460.cu
$ ./t460
Success!
kernel1 : 2213ms, kernel2 : 4660ms, kernel3 : 691ms, kernel4 : 99ms
$

Hopefully that will get you going with more ideas of things to work on. You may get different timings of course on your cc3.0 device.

Are further optimizations possible? Probably. The first target I would look at would be to figure out how to take advantage of the data-reuse opportunities on vector A. (data re-use of vector B is already handled in the kernel 4 by loading it into shared memory. There may be ways to use some shared memory to store portions of A to make the code run even faster.)

I guess I should also mention that following the lead of the code you provided, this code is computing the square of the euclidean distance. A trivial modification to the kernels can make it compute the actual euclidean distance instead (C[...] = sqrtf(...);) The validation I have included, however, assumes the results are "in-range" for perfect storage of an integer quantity in a float. Your test case satisfies this requirement, but otherwise the validation code would need to be modified (if sqrtf were used).

这篇关于在CUDA中增加每个线程的工作的示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆