使用 CUDA 显示 GPU 优于 CPU 的最简单示例 [英] Simplest Possible Example to Show GPU Outperform CPU Using CUDA

查看:40
本文介绍了使用 CUDA 显示 GPU 优于 CPU 的最简单示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找可以为 CPU(使用 g++)和 GPU(使用 nvcc)编写的最简洁的代码量,GPU 的性能始终优于 CPU.任何类型的算法都是可以接受的.

I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.

澄清一下:我实际上是在寻找两段短代码,一段用于 CPU(在 g++ 中使用 C++),另一段用于 GPU(在 nvcc 中使用 C++),GPU 的性能优于后者.最好以秒或毫秒为单位.最短的代码对.

To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

推荐答案

首先,我重申一下我的评论:GPU 是高带宽、高延迟的.试图让 GPU 在纳秒的工作(甚至是毫秒或第二个工作)上击败 CPU 完全失去了做 GPU 工作的意义.下面是一些简单的代码,但要真正体会到 GPU 的性能优势,您需要一个很大的问题规模来分摊启动成本……否则,这毫无意义.我可以在两英尺的比赛中击败法拉利,仅仅是因为转动钥匙、启动发动机和踩踏板需要一些时间.这并不意味着我在任何意义上都比法拉利更快.

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

在 C++ 中使用类似的东西:

Use something like this in C++:

  #define N (1024*1024)
  #define M (1000000)
  int main()
  {
     float data[N]; int count = 0;
     for(int i = 0; i < N; i++)
     {
        data[i] = 1.0f * i / N;
        for(int j = 0; j < M; j++)
        {
           data[i] = data[i] * data[i] - 0.25f;
        }
     }
     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f
", sel, data[sel]);
  }

在 CUDA/C 中使用类似的东西:

Use something like this in CUDA/C:

  #define N (1024*1024)
  #define M (1000000)

  __global__ void cudakernel(float *buf)
  {
     int i = threadIdx.x + blockIdx.x * blockDim.x;
     buf[i] = 1.0f * i / N;
     for(int j = 0; j < M; j++)
        buf[i] = buf[i] * buf[i] - 0.25f;
  }

  int main()
  {
     float data[N]; int count = 0;
     float *d_data;
     cudaMalloc(&d_data, N * sizeof(float));
     cudakernel<<<N/256, 256>>>(d_data);
     cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
     cudaFree(d_data); 

     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f
", sel, data[sel]);
  }

如果这不起作用,请尝试将 N 和 M 变大,或将 256 更改为 128 或 512.

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

这篇关于使用 CUDA 显示 GPU 优于 CPU 的最简单示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆