使用CUDA显示GPU优于CPU的最简单可能示例 [英] Simplest Possible Example to Show GPU Outperform CPU Using CUDA

查看:417
本文介绍了使用CUDA显示GPU优于CPU的最简单可能示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找可以编码的最简洁的代码,这些代码可以用于CPU(使用g ++)和GPU(使用nvcc),GPU一直优于CPU。任何类型的算法都是可以接受的。



澄清:我真的在寻找两个短代码块,一个用于CPU(在g ++中使用C ++) GPU(在nvcc中使用C ++)的GPU优于其。最好在秒或毫秒的范围内。最短的代码对可能。

首先,我会重申我的评论:GPU是高带宽,高延迟。试图让GPU在纳秒级作业(甚至是毫秒级或二级作业)中击败一个CPU,完全没有考虑到GPU的重要性。下面是一些简单的代码,但要真正体会GPU的性能优势,您需要一个很大的问题大小来分摊启动成本,否则,这是没有意义的。我可以在两场比赛中击败一辆法拉利车,只是因为需要一些时间才能转动钥匙,启动发动机并踩下踏板。这并不意味着我比任何有意义的方式比法拉利快。



在C ++中使用类似这样的内容:

  #define N(1024 * 1024)
#define M(1000000)
int main()
{
float data [N]; int count = 0;
for(int i = 0; i {
data [i] = 1.0f * i / N;
for(int j = 0; j {
data [i] = data [i] * data [i] - 0.25f;
}
}
int sel;
printf(输入索引:);
scanf(%d,& sel);
printf(data [%d] =%f\\\
,sel,data [sel]);

在CUDA / C中使用类似的内容:

  #define N(1024 * 1024)
#define M(1000000)

__global__ void cudakernel(float * buf )
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf [i] = 1.0f * i / N;
for(int j = 0; j buf [i] = buf [i] * buf [i] - 0.25f;
}

int main()
{
float data [N]; int count = 0;
float * d_data;
cudaMalloc(& d_data,N * sizeof(float));
cudakernel <<< N / 256,256 >>>>(d_data);
cudaMemcpy(data,d_data,N * sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(d_data);

int sel;
printf(输入索引:);
scanf(%d,& sel);
printf(data [%d] =%f\\\
,sel,data [sel]);

$ / code>

如果这不起作用,请尝试使N和M变大或改变256到128或512。


I am looking for the most concise amount of code possible that can be coded both for a CPU (using g++) and a GPU (using nvcc) for which the GPU consistently outperforms the CPU. Any type of algorithm is acceptable.

To clarify: I'm literally looking for two short blocks of code, one for the CPU (using C++ in g++) and one for the GPU (using C++ in nvcc) for which the GPU outperforms. Preferably on the scale of seconds or milliseconds. The shortest code pair possible.

解决方案

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

Use something like this in C++:

  #define N (1024*1024)
  #define M (1000000)
  int main()
  {
     float data[N]; int count = 0;
     for(int i = 0; i < N; i++)
     {
        data[i] = 1.0f * i / N;
        for(int j = 0; j < M; j++)
        {
           data[i] = data[i] * data[i] - 0.25f;
        }
     }
     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

Use something like this in CUDA/C:

  #define N (1024*1024)
  #define M (1000000)

  __global__ void cudakernel(float *buf)
  {
     int i = threadIdx.x + blockIdx.x * blockDim.x;
     buf[i] = 1.0f * i / N;
     for(int j = 0; j < M; j++)
        buf[i] = buf[i] * buf[i] - 0.25f;
  }

  int main()
  {
     float data[N]; int count = 0;
     float *d_data;
     cudaMalloc(&d_data, N * sizeof(float));
     cudakernel<<<N/256, 256>>>(d_data);
     cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
     cudaFree(d_data); 

     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.

这篇关于使用CUDA显示GPU优于CPU的最简单可能示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆