共3个记录在CUDA是很慢 [英] count3's in cuda is very slow

查看:160
本文介绍了共3个记录在CUDA是很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在CUDA才是最重要的有多少3是C数组并打印它们写了一个小程序。

I have written a small program in CUDA that counts how many 3's are in a C array and prints them.

#include <stdio.h>
#include <assert.h>
#include <cuda.h>
#include <cstdlib>

__global__ void incrementArrayOnDevice(int *a, int N, int *count)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    //__shared__ int s_a[512]; // one for each thread
    //s_a[threadIdx.x] = a[id];

    if( id < N )
    {
        //if( s_a[threadIdx.x] == 3 )
        if( a[id] == 3 )
        {
            atomicAdd(count, 1);
        }
    }
}

int main(void)
{
    int *a_h;   // host memory
    int *a_d;   // device memory

    int N = 16777216;

    // allocate array on host
    a_h = (int*)malloc(sizeof(int) * N);
    for(int i = 0; i < N; ++i)
        a_h[i] = (i % 3 == 0 ? 3 : 1);

    // allocate arrays on device
    cudaMalloc(&a_d, sizeof(int) * N);

    // copy data from host to device
    cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);

    // do calculation on device
    int blockSize = 512;
    int nBlocks = N / blockSize + (N % blockSize == 0 ? 0 : 1);
    printf("number of blocks: %d\n", nBlocks);

    int count;
    int *devCount;
    cudaMalloc(&devCount, sizeof(int));
    cudaMemset(devCount, 0, sizeof(int));

    incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);

    // retrieve result from device
    cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d\n", count);

    free(a_h);
    cudaFree(a_d);
    cudaFree(devCount);
}

结果我得到的是:
真正0m3.025s
用户0m2.989s
SYS 0m0.029s

The result I get is: real 0m3.025s user 0m2.989s sys 0m0.029s

当我用4个线程的CPU上运行它,我得到:
真正0m0.101s
用户0m0.100s
SYS 0m0.024s

When I run it on the CPU with 4 threads I get: real 0m0.101s user 0m0.100s sys 0m0.024s

请注意,该GPU是一个古老的 - 我不知道确切的模型,因为我没有给它的root访问权限,但它运行OpenGL的版本是1.2​​使用MESA驱动程序

Note that the GPU is an old one - I don't know the exact model because I do not have root access to it, but the OpenGL version it runs is 1.2 using the MESA driver.

我是不是做错了什么?我能做些什么,以使其运行速度更快?

Am I doing something wrong? What can I do to make it run faster?

请注意:我一直在使用水桶每个块(这样atomicAdd()秒将减少为每一个),但我得到完全相同的性能尝试。
我也曾尝试复制分配给该块内存共享块512整数(你可以看到它在评论)和时间是一样了。

Note: I have tried using buckets for each block (so the atomicAdd()s would be reduced for each one) but I get exactly the same performance. I have also tried copying the 512 integers that are assigned to this block to a shared block of memory (you can see it in the comments) and the time is the same again.

推荐答案

这是针对你的问题我能做些什么,以使其运行速度更快?正如我在评论中提到的,有问题(可能)随着时间的方法,主要建议我对速度的提升是使用经典并行减少算法。下面code实现一个更好的(在我看来)定时测量,同时你的内核转换为削减风格的内核:

This is in response to your question "What can I do to make it run faster?" As I mentioned in the comments, there are issues (probably) with the timing methodology, and the main suggestion I have for speed improvement is to use a "classical parallel reduction" algorithm. The following code implements a better (in my opinion) timing measurement, and also converts your kernel to a reduction style kernel:

#include <stdio.h>
#include <assert.h>
#include <cstdlib>


#define N (1<<24)
#define nTPB 512
#define NBLOCKS 32

__global__ void incrementArrayOnDevice(int *a, int n, int *count)
{
  __shared__ int lcnt[nTPB];
  int id = blockIdx.x * blockDim.x + threadIdx.x;
  int lcount = 0;
  while (id < n) {
    if (a[id] == 3) lcount++;
    id += gridDim.x * blockDim.x;
    }
  lcnt[threadIdx.x] = lcount;
  __syncthreads();
  int stride = blockDim.x;
  while(stride > 1) {
    // assume blockDim.x is a power of 2
    stride >>= 1;
    if (threadIdx.x < stride) lcnt[threadIdx.x] += lcnt[threadIdx.x + stride];
    __syncthreads();
    }
  if (threadIdx.x == 0) atomicAdd(count, lcnt[0]);
}

int main(void)
{
    int *a_h;   // host memory
    int *a_d;   // device memory
    cudaEvent_t gstart1,gstart2,gstop1,gstop2,cstart,cstop;
    float etg1, etg2, etc;

    cudaEventCreate(&gstart1);
    cudaEventCreate(&gstart2);
    cudaEventCreate(&gstop1);
    cudaEventCreate(&gstop2);
    cudaEventCreate(&cstart);
    cudaEventCreate(&cstop);

    // allocate array on host
    a_h = (int*)malloc(sizeof(int) * N);
    for(int i = 0; i < N; ++i)
        a_h[i] = (i % 3 == 0 ? 3 : 1);

    // allocate arrays on device
    cudaMalloc(&a_d, sizeof(int) * N);

    int blockSize = nTPB;
    int nBlocks = NBLOCKS;
    printf("number of blocks: %d\n", nBlocks);

    int count;
    int *devCount;
    cudaMalloc(&devCount, sizeof(int));
    cudaMemset(devCount, 0, sizeof(int));

    // copy data from host to device
    cudaEventRecord(gstart1);
    cudaMemcpy(a_d, a_h, sizeof(int) * N, cudaMemcpyHostToDevice);
    cudaMemset(devCount, 0, sizeof(int));
    cudaEventRecord(gstart2);
    // do calculation on device

    incrementArrayOnDevice<<<nBlocks, blockSize>>> (a_d, N, devCount);
    cudaEventRecord(gstop2);

    // retrieve result from device
    cudaMemcpy(&count, devCount, sizeof(int), cudaMemcpyDeviceToHost);
    cudaEventRecord(gstop1);

    printf("GPU count = %d\n", count);
    int hostCount = 0;
    cudaEventRecord(cstart);
    for (int i=0; i < N; i++)
      if (a_h[i] == 3) hostCount++;
    cudaEventRecord(cstop);

    printf("CPU count = %d\n", hostCount);
    cudaEventSynchronize(cstop);
    cudaEventElapsedTime(&etg1, gstart1, gstop1);
    cudaEventElapsedTime(&etg2, gstart2, gstop2);
    cudaEventElapsedTime(&etc, cstart, cstop);

    printf("GPU total time   = %fs\n", (etg1/(float)1000) );
    printf("GPU compute time = %fs\n", (etg2/(float)1000));
    printf("CPU time         = %fs\n", (etc/(float)1000));
    free(a_h);
    cudaFree(a_d);
    cudaFree(devCount);
}

当我在一个相当快的GPU上运行此(一的Quadro 5000,比特斯拉M2050慢一点)我得到以下几点:

When I run this on a reasonably fast GPU (a Quadro 5000, a little slower than a Tesla M2050) I get the following:

number of blocks: 32
GPU count = 5592406
CPU count = 5592406
GPU total time   = 0.025714s
GPU compute time = 0.000793s
CPU time         = 0.017332s

我们看到,GPU比这个(天真,单线程)CPU执行的计算部分基本上快。当我们在成本增加传输数据中,GPU版本较慢,但并不30倍速度较慢。

We see that the GPU is substantially faster than this (naive, single-threaded) CPU implementation for the compute portion. When we add in the cost to transfer the data, the GPU version is slower but is not 30x slower.

相比之下,当我计时原来的算法,我得到的数字是这样的:

By way of comparison, when I timed your original algorithm, I got numbers like this:

GPU total time   = 0.118131s
GPU compute time = 0.093213s

我的系统配置为,这是至强X5560 CPU,RHEL 5.5,CUDA 5.0,Quadro5000 GPU。

My system config for this was Xeon X5560 CPU, RHEL 5.5, CUDA 5.0, Quadro5000 GPU.

这篇关于共3个记录在CUDA是很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆