共3个记录在CUDA是很慢 [英] count3's in cuda is very slow

查看：160 发布时间：2016/8/19 15:06:39 c++ c cuda gpu

本文介绍了共3个记录在CUDA是很慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在CUDA才是最重要的有多少3是C数组并打印它们写了一个小程序。

I have written a small program in CUDA that counts how many 3's are in a C array and prints them.

#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#include <cstdlib>

__global__ void incrementArrayOnDevice(int *a, int N, int *count)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    //__shared__ int s_a[512]; // one for each thread
    //s_a[threadIdx.x] = a[id];

    if( id < N )
    {
        //if( s_a[threadIdx.x] == 3 )
        if( a[id] == 3 )
        {
            atomicAdd(count, 1);
        }
    }
}

int main(void)
{
    int *a_h;   // host memory
    int *a_d;   // device memory

    int N = 16777216;

    // allocate array on host
    a_h = (int*)malloc(sizeof(int) * N);
    for(int i = 0; i < N; ++i)
        a_h[i] = (i % 3 == 0 ? 3 : 1);

    // allocate arrays on device
    cudaMalloc(&a_d, sizeof(int) * N);

    // copy data from host to device
    cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);

    // do calculation on device
    int blockSize = 512;
    int nBlocks = N / blockSize + (N % blockSize == 0 ? 0 : 1);
    printf("number of blocks: %d\n", nBlocks);

    int count;
    int *devCount;
    cudaMalloc(&devCount, sizeof(int));
    cudaMemset(devCount, 0, sizeof(int));

    incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);

    // retrieve result from device
    cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d\n", count);

    free(a_h);
    cudaFree(a_d);
    cudaFree(devCount);
}

结果我得到的是：
真正0m3.025s
用户0m2.989s
SYS 0m0.029s

The result I get is: real 0m3.025s user 0m2.989s sys 0m0.029s

当我用4个线程的CPU上运行它，我得到：
真正0m0.101s
用户0m0.100s
SYS 0m0.024s

When I run it on the CPU with 4 threads I get: real 0m0.101s user 0m0.100s sys 0m0.024s

请注意，该GPU是一个古老的 - 我不知道确切的模型，因为我没有给它的root访问权限，但它运行OpenGL的版本是1.2使用MESA驱动程序

Note that the GPU is an old one - I don't know the exact model because I do not have root access to it, but the OpenGL version it runs is 1.2 using the MESA driver.

我是不是做错了什么？我能做些什么，以使其运行速度更快？

Am I doing something wrong? What can I do to make it run faster?

请注意：我一直在使用水桶每个块（这样atomicAdd（）秒将减少为每一个），但我得到完全相同的性能尝试。
我也曾尝试复制分配给该块内存共享块512整数（你可以看到它在评论）和时间是一样了。

Note: I have tried using buckets for each block (so the atomicAdd()s would be reduced for each one) but I get exactly the same performance. I have also tried copying the 512 integers that are assigned to this block to a shared block of memory (you can see it in the comments) and the time is the same again.

推荐答案

这是针对你的问题我能做些什么，以使其运行速度更快？正如我在评论中提到的，有问题（可能）随着时间的方法，主要建议我对速度的提升是使用经典并行减少算法。下面code实现一个更好的（在我看来）定时测量，同时你的内核转换为削减风格的内核：

This is in response to your question "What can I do to make it run faster?" As I mentioned in the comments, there are issues (probably) with the timing methodology, and the main suggestion I have for speed improvement is to use a "classical parallel reduction" algorithm. The following code implements a better (in my opinion) timing measurement, and also converts your kernel to a reduction style kernel:

#include <stdio.h>
#include <assert.h>
#include <cstdlib>


#define N (1<<24)
#define nTPB 512
#define NBLOCKS 32

__global__ void incrementArrayOnDevice(int *a, int n, int *count)
{
  __shared__ int lcnt[nTPB];
  int id = blockIdx.x * blockDim.x + threadIdx.x;
  int lcount = 0;
  while (id < n) {
    if (a[id] == 3) lcount++;
    id += gridDim.x * blockDim.x;
    }
  lcnt[threadIdx.x] = lcount;
  __syncthreads();
  int stride = blockDim.x;
  while(stride > 1) {
    // assume blockDim.x is a power of 2
    stride >>= 1;
    if (threadIdx.x < stride) lcnt[threadIdx.x] += lcnt[threadIdx.x + stride];
    __syncthreads();
    }
  if (threadIdx.x == 0) atomicAdd(count, lcnt[0]);
}

int main(void)
{
    int *a_h;   // host memory
    int *a_d;   // device memory
    cudaEvent_t gstart1,gstart2,gstop1,gstop2,cstart,cstop;
    float etg1, etg2, etc;

    cudaEventCreate(&gstart1);
    cudaEventCreate(&gstart2);
    cudaEventCreate(&gstop1);
    cudaEventCreate(&gstop2);
    cudaEventCreate(&cstart);
    cudaEventCreate(&cstop);

    // allocate array on host
    a_h = (int*)malloc(sizeof(int) * N);
    for(int i = 0; i < N; ++i)
        a_h[i] = (i % 3 == 0 ? 3 : 1);

    // allocate arrays on device
    cudaMalloc(&a_d, sizeof(int) * N);

    int blockSize = nTPB;
    int nBlocks = NBLOCKS;
    printf("number of blocks: %d\n", nBlocks);

    int count;
    int *devCount;
    cudaMalloc(&devCount, sizeof(int));
    cudaMemset(devCount, 0, sizeof(int));

    // copy data from host to device
    cudaEventRecord(gstart1);
    cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);
    cudaMemset(devCount, 0, sizeof(int));
    cudaEventRecord(gstart2);
    // do calculation on device

    incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);
    cudaEventRecord(gstop2);

    // retrieve result from device
    cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);
    cudaEventRecord(gstop1);

    printf("GPU count = %d\n", count);
    int hostCount = 0;
    cudaEventRecord(cstart);
    for (int i=0; i < N; i++)
      if (a_h[i] == 3) hostCount++;
    cudaEventRecord(cstop);

    printf("CPU count = %d\n", hostCount);
    cudaEventSynchronize(cstop);
    cudaEventElapsedTime(&etg1, gstart1, gstop1);
    cudaEventElapsedTime(&etg2, gstart2, gstop2);
    cudaEventElapsedTime(&etc, cstart, cstop);

    printf("GPU total time   = %fs\n", (etg1/(float)1000) );
    printf("GPU compute time = %fs\n", (etg2/(float)1000));
    printf("CPU time         = %fs\n", (etc/(float)1000));
    free(a_h);
    cudaFree(a_d);
    cudaFree(devCount);
}

当我在一个相当快的GPU上运行此（一的Quadro 5000，比特斯拉M2050慢一点）我得到以下几点：

When I run this on a reasonably fast GPU (a Quadro 5000, a little slower than a Tesla M2050) I get the following:

number of blocks: 32
GPU count = 5592406
CPU count = 5592406
GPU total time   = 0.025714s
GPU compute time = 0.000793s
CPU time         = 0.017332s

我们看到，GPU比这个（天真，单线程）CPU执行的计算部分基本上快。当我们在成本增加传输数据中，GPU版本较慢，但并不30倍速度较慢。

We see that the GPU is substantially faster than this (naive, single-threaded) CPU implementation for the compute portion. When we add in the cost to transfer the data, the GPU version is slower but is not 30x slower.

相比之下，当我计时原来的算法，我得到的数字是这样的：

By way of comparison, when I timed your original algorithm, I got numbers like this:

GPU total time   = 0.118131s
GPU compute time = 0.093213s

我的系统配置为，这是至强X5560 CPU，RHEL 5.5，CUDA 5.0，Quadro5000 GPU。

My system config for this was Xeon X5560 CPU, RHEL 5.5, CUDA 5.0, Quadro5000 GPU.

这篇关于共3个记录在CUDA是很慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

共3个记录在CUDA是很慢 [英] count3's in cuda is very slow

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

共3个记录在CUDA是很慢 [英] count3&#39;s in cuda is very slow

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

共3个记录在CUDA是很慢 [英] count3's in cuda is very slow

登录关闭