CUDA质数发生器低性能 [英] Low performance in CUDA prime number generator

查看:174
本文介绍了CUDA质数发生器低性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在CUDA写我的第一个程序。它是一个素数发生器。它工作,但它只比同等的单线程C ++代码快50%。 CPU版本使用一个核心的100%。 GPU版本只使用GPU的20%。 CPU是i5(2310)。 GPU是GF104。

I am writing my first program in CUDA. It is a prime number generator. It works, but it is only 50% faster than the equivalent single threaded C++ code. The CPU version uses 100% of one core. The GPU version uses only 20% of the GPU. The CPU is an i5 (2310). The GPU is a GF104.

如何提高这个算法的性能?

How can I improve the performance of this algorithm?

我的完整程序。

int* d_C;

using namespace std;

__global__ void primo(int* C, int N, int multi)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < N) 
  {
    if(i%2==0||i%3==0||i%5==0||i%7==0)
    {
      C[i]=0;           
    }
    else
    {
      C[i]=i+N*multi;
    }
  }
}

int main()
{
  cout<<"Prime numbers \n";
  int N=1000;
  int h_C[1000];
  size_t size=N* sizeof(int);
  cudaMalloc((void**)&d_C, size);

  int threadsPerBlock = 1024;
  int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
  vector<int> lista(100000000);
  int c_z=0;

  for(int i=0;i<100000;i++)
  {
    primo<<<blocksPerGrid, threadsPerBlock>>>(d_C, N,i);    
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);         
    for(int c=0;c<N;c++)
    {   
      if(h_C[c]!=0)
      {
        lista[c+N*i-c_z]=h_C[c];
      }
      else
      {
        c_z++;
      }
    }   
  }
  lista.resize(lista.size()-c_z+1);
  return(0);
}



我尝试使用二维数组和 / code>循环,但无法获得正确的结果。

I tried using a 2D array and a for loop in the kernel, but was unable to get the correct results.

推荐答案

欢迎使用Stack Overflow。

Welcome to Stack Overflow.

以下是一些潜在问题:


  • N = 1000太低。由于你有1024 threadsPerBlock ,你的内核将只运行一个块,这不足以利用GPU。

  • N = 1000 is too low. Since you have 1024 threadsPerBlock, your kernel will run only one block, which is not enough to utilize the GPU. Try N = 1000000, so that your kernel launch is for nearly 1000 blocks.

您在GPU上做的工作很少(每个数字4个模运算)测试)。因此,在CPU上执行这些操作可能比从GPU(通过PCIe总线)复制它们更快。

You're doing very little work on the GPU (4 modulus operations per number tested). So it's probably faster to do those operations on the CPU than it is to copy them from the GPU (over the PCIe bus).

为了使用GPU来寻找素数,我认为您需要在GPU上实现整个算法,而不仅仅是模数运算。

To make it worthwhile to use the GPU for finding the prime numbers, I think you need to implement the entire algorithm on the GPU, instead of just the modulus operations.

这篇关于CUDA质数发生器低性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆