为什么CUDA不会导致C ++代码加速? [英] why CUDA doesn't result in speedup in C++ code?

查看:57
本文介绍了为什么CUDA不会导致C ++代码加速?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用VS2019,并且具有NVIDIA GeForce GPU.我从此链接尝试了代码: https://towardsdatascience.com/用cuda-c18677dcdd5f编写闪电般的快速代码

I'm using VS2019 and have an NVIDIA GeForce GPU. I tried the code from this link: https://towardsdatascience.com/writing-lightning-fast-code-with-cuda-c18677dcdd5f

该文章的作者声称使用CUDA时可以加快速度.但是,对我来说,串行版本大约需要7毫秒,而CUDA版本大约需要28毫秒.为什么此代码的CUDA速度较慢?我使用的代码如下:

The author of that post claims to get a speedup when using CUDA. However, for me, the serial version takes around 7 milliseconds while the CUDA version takes around 28 milliseconds. Why is CUDA slower for this code? The code I used is below:

__global__
void add(int n, float* x, float* y)
{

    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
}

void addSerial(int n, float* x, float* y)
{
    for (int i = 0; i < n; i++)
        y[i] = x[i] + y[i];
}

int main()
{
    int NSerial = 1 << 20;   
    float* xSerial = new float[NSerial];
    float* ySerial = new float[NSerial];
    for (int i = 0; i < NSerial; i++) {
        xSerial[i] = 1.0f;
        ySerial[i] = 2.0f;
    }
    auto t1Serial = std::chrono::high_resolution_clock::now();
    addSerial(NSerial, xSerial, ySerial);
    auto t2Serial = std::chrono::high_resolution_clock::now(); 
    auto durationSerial = std::chrono::duration_cast<std::chrono::milliseconds>(t2Serial - t1Serial).count(); 
    float maxErrorSerial = 0.0f;
    for (int i = 0; i < NSerial; i++)
        maxErrorSerial = fmax(maxErrorSerial, fabs(ySerial[i] - 3.0f));
    std::cout << "Max error Serial: " << maxErrorSerial << std::endl;
    std::cout << "durationSerial: "<<durationSerial << std::endl;
    delete[] xSerial;
    delete[] ySerial;


    int N = 1 << 20;   

    float* x, * y;
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }


    int device = -1;
    cudaGetDevice(&device);
    cudaMemPrefetchAsync(x, N * sizeof(float), device, NULL);
    cudaMemPrefetchAsync(y, N * sizeof(float), device, NULL);


    int blockSize = 1024;
    int numBlocks = (N + blockSize - 1) / blockSize;
    auto t1 = std::chrono::high_resolution_clock::now();
    add << <numBlocks, blockSize >> > (N, x, y);

    cudaDeviceSynchronize();
    auto t2 = std::chrono::high_resolution_clock::now(); 
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count(); 

    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    std::cout << "Max error: " << maxError << std::endl;
    std::cout << "duration CUDA: "<<duration; 

    cudaFree(x);
    cudaFree(y);



    return 0;
}

推荐答案

在这里有几个观察点:

  1. 第一次调用CUDA内核可能会累积大量与GPU上的设置相关的一次延迟,因此通常的方法是包括热身"调用
  2. 问题中的内核设计是常驻"设计,因此,当您仅启动所需的块以完全占用GPU时,应该执行最佳执行.您可以使用一个API来获取GPU的此信息.
  3. 执行计时(以微秒为单位,而不是毫秒)
  4. 以发布模式构建代码.

对您的CUDA代码执行所有这些操作后,我会得到这个信息:

Doing all of this to your CUDA code gets me this:

    int N = 1 << 20;   
    int device = -1;
    cudaGetDevice(&device);

    float* x, * y;
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));

    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }
    cudaMemPrefetchAsync(x, N * sizeof(float), device, NULL);
    cudaMemPrefetchAsync(y, N * sizeof(float), device, NULL);

    int blockSize, numBlocks;
    cudaOccupancyMaxPotentialBlockSize(&numBlocks, &blockSize, add);

    for(int rep=0; rep<10; rep++) {
        auto t1 = std::chrono::high_resolution_clock::now();
        add << <numBlocks, blockSize >> > (N, x, y);
        cudaDeviceSynchronize();
        auto t2 = std::chrono::high_resolution_clock::now(); 
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count(); 
        std::cout << rep << " duration CUDA: " << duration <<std::endl; 
    }

    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 12.0f));
    std::cout << "Max error: " << maxError << std::endl;

    cudaFree(x);
    cudaFree(y);

构建并运行它:

$ nvcc -arch=sm_52 -std=c++11 -o not_so_fast not_so_fast.cu 
$ ./not_so_fast 
Max error Serial: 0
durationSerial: 2762
0 duration CUDA: 1074
1 duration CUDA: 150
2 duration CUDA: 151
3 duration CUDA: 158
4 duration CUDA: 152
5 duration CUDA: 152
6 duration CUDA: 147
7 duration CUDA: 124
8 duration CUDA: 112
9 duration CUDA: 113
Max error: 0

在我的系统上,第一个GPU的运行速度接近串行循环的三倍.第二次及以后的运行速度几乎快了10倍.您的结果可能(可能会)有所不同.

On my system, the first GPU run close to three times as fast as the serial loop. The second and subsequent runs are almost 10 times faster again. Your results can (and probably will) vary.

这篇关于为什么CUDA不会导致C ++代码加速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆