简单的CUDA内核优化 [英] Simple CUDA Kernel Optimization

查看:151
本文介绍了简单的CUDA内核优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在加速应用程序的过程中,我有一个非常简单的内核,其类型转换如下所示:

In the process of speeding up an application, I have a very simple kernel which does the type casting as shown below:

__global__ void UChar2FloatKernel(float *out, unsigned char *in, int nElem){
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(i<nElem)
        out[i] = (float) in[i];
}

全局内存访问被合并,在我的理解中使用共享内存也不会是有益的,因为没有相同存储器的多次读取。任何人都有任何想法,如果有任何可以执行的优化,以加快这个内核。输入和输出数据已经在设备上,因此不需要主机到设备内存的复制。

The global memory access is coalesced and in my understanding using shared memory will also not be beneficial as there are not multiple reads of the same memory. Does any one have any idea if there is any optimization which can be performed to speed up this kernel. The input and output data is already on the device, so no host to device memory copy will be required.

推荐答案

优化你可以对一个代码执行,就是使用常驻线程,并增加每个线程执行的事务数。虽然CUDA块调度模型非常轻量级,但是它不是免费的,并且启动很多包含只有单个内存负载和单个内存存储的线程的块将会产生大量的块调度开销。所以只启动尽可能多的块,填充你的GPU的所有SM,并让每个线程做更多的工作。

The single biggest optimisation you can perform on a code like that one is to use resident threads and increase the number of transactions each thread performs. While the CUDA block scheduling model is pretty lightweight, it isn't free, and launching a lot blocks containing threads which do only a single memory load and single memory store will accrue a lot of block scheduling overhead. So only launch as many blocks as will "fill" the all the SM of your GPU and have each thread do more work.

第二个明显的优化是切换到128字节负载的内存事务,这应该给你一个有形的带宽利用率增益。在费米或Kepler GPU上,这将不会像第一代和第二代硬件那样提供更大的性能提升。

The second obvious optimization is switch to 128 byte memory transactions for loads, which should give you a tangible bandwidth utilization gain. On a Fermi or Kepler GPU this won't give as large a performance boost as on first and second generation hardware.

将这些完全整合到一个简单的基准中:

Putting this altogether into a simple benchmark:

__global__ 
void UChar2FloatKernel(float *out, unsigned char *in, int nElem)
{
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    if(i<nElem)
        out[i] = (float) in[i];
}

__global__
void UChar2FloatKernel2(float  *out, 
                const unsigned char *in, 
            int nElem)
{
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;    
    for(; i<nElem; i+=gridDim.x*blockDim.x) {
        out[i] = (float) in[i];
    }
}

__global__
void UChar2FloatKernel3(float4  *out, 
                const uchar4 *in, 
            int nElem)
{
    unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;    
    for(; i<nElem; i+=gridDim.x*blockDim.x) {
        uchar4 ival = in[i]; // 32 bit load
        float4 oval = make_float4(ival.x, ival.y, ival.z, ival.w);
        out[i] = oval; // 128 bit store
    }
}

int main(void)
{

    const int n = 2 << 20;
    unsigned char *a = new unsigned char[n];

    for(int i=0; i<n; i++) {
        a[i] = i%255;
    }

    unsigned char *a_;
    cudaMalloc((void **)&a_, sizeof(unsigned char) * size_t(n));
    float *b_;
    cudaMalloc((void **)&b_, sizeof(float) * size_t(n));
    cudaMemset(b_, 0, sizeof(float) * size_t(n)); // warmup

    for(int i=0; i<5; i++)
    {
        dim3 blocksize(512);
        dim3 griddize(n/512);
        UChar2FloatKernel<<<griddize, blocksize>>>(b_, a_, n);
    }

    for(int i=0; i<5; i++)
    {
        dim3 blocksize(512);
        dim3 griddize(8); // 4 blocks per SM
        UChar2FloatKernel2<<<griddize, blocksize>>>(b_, a_, n);
    }

    for(int i=0; i<5; i++)
    {
        dim3 blocksize(512);
        dim3 griddize(8); // 4 blocks per SM
        UChar2FloatKernel3<<<griddize, blocksize>>>((float4*)b_, (uchar4*)a_, n/4);
    }
    cudaDeviceReset();
    return 0;
}  

在小型Fermi设备上显示:

gives me this on a small Fermi device:

>nvcc -m32 -Xptxas="-v" -arch=sm_21 cast.cu
cast.cu
tmpxft_000014c4_00000000-5_cast.cudafe1.gpu
tmpxft_000014c4_00000000-10_cast.cudafe2.gpu
cast.cu
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z18UChar2FloatKernel2PfPKhi' for 'sm_2
1'
ptxas : info : Function properties for _Z18UChar2FloatKernel2PfPKhi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 5 registers, 44 bytes cmem[0]
ptxas : info : Compiling entry function '_Z18UChar2FloatKernel3P6float4PK6uchar4
i' for 'sm_21'
ptxas : info : Function properties for _Z18UChar2FloatKernel3P6float4PK6uchar4i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 8 registers, 44 bytes cmem[0]
ptxas : info : Compiling entry function '_Z17UChar2FloatKernelPfPhi' for 'sm_21'

ptxas : info : Function properties for _Z17UChar2FloatKernelPfPhi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 3 registers, 44 bytes cmem[0]
tmpxft_000014c4_00000000-5_cast.cudafe1.cpp
tmpxft_000014c4_00000000-15_cast.ii

>nvprof a.exe
======== NVPROF is profiling a.exe...
======== Command: a.exe
======== Profiling result:
 Time(%)      Time   Calls       Avg       Min       Max  Name
   40.20    6.61ms       5    1.32ms    1.32ms    1.32ms  UChar2FloatKernel(float*, unsigned char*, int)
   29.43    4.84ms       5  968.32us  966.53us  969.46us  UChar2FloatKernel2(float*, unsigned char const *, int)
   26.35    4.33ms       5  867.00us  866.26us  868.10us  UChar2FloatKernel3(float4*, uchar4 const *, int)
    4.02  661.34us       1  661.34us  661.34us  661.34us  [CUDA memset]

在后两个内核中,与4096个块相比,仅使用8个块可以大大提高速度,这证实了每个线程多个工作项在这种内存限制,低指令数内核中提高性能的最好方法。

In the latter two kernel, using only 8 blocks gives a large speed up compared to 4096 blocks, which confirms the idea that multiple work items per thread is the best way to improve performance in this sort of memory bound, low instruction count kernel.

这篇关于简单的CUDA内核优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆