查找最小的GPU比CPU慢 [英] Finding minimum in GPU slower than CPU

查看:97
本文介绍了查找最小的GPU比CPU慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已实施此代码: http://www.cuvilib.com/Reduction.pdf以计算矩阵元素的和。



但是在GPU中它比CPU慢得多。



我有i7处理器和NVIDIA GT 540M显卡卡。



它应该是那种方式还是别的什么?



编辑:我使用上述版本3代码在Ubuntu 13.04和我编译它使用Eclipse Nsight。矩阵的大小为2097152个元素。它在3.6 ms中执行,而CPU版本在大约1.0 ms中执行。下面是整个代码:

  #include< stdio.h> 
#include< stdlib.h>
#include< thrust / sort.h>
#include< sys / time.h>
#include< omp.h>
#include< iostream>
#include< algorithm>

#define MIN(a,b)(((a)<(b))?(a):( b))



static const int WORK_SIZE = 2097152;



int find_min(int * a,int length){
int min = a [0];
for(int i = 1; i if(a [i] min = a [i]
return min;
}


__global__ static void red_min(int * g_idata,int * g_odata){
extern __shared__ int sdata [];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata [tid] = g_idata [i];
__syncthreads();

for(unsigned int s = blockDim.x / 2; s> 0; s>> = 1){
if(tid< s){
sdata [tid] = MIN(sdata [tid],sdata [tid + s]);
}
__syncthreads();
}
if(tid == 0)
g_odata [blockIdx.x] = sdata [0];
}





int main(void){
int * d1,* d2;
int i,* result;
int * idata,* fdata;
srand(time(NULL));
result =(int *)malloc(sizeof(int));
idata =(int *)malloc(WORK_SIZE * sizeof(int));
fdata =(int *)malloc(WORK_SIZE * sizeof(int));
cudaMalloc((int **)& d1,WORK_SIZE * sizeof(int));
cudaMalloc((int **)& d2,WORK_SIZE * sizeof(int));


for(i = 0; i idata [i] = rand();
fdata [i] = i;
}
struct timeval begin,end;
gettimeofday(& begin,NULL);
* result = find_min(idata,WORK_SIZE);
printf(Minimum Element CPU:%d \\\
,* result);
gettimeofday(& end,NULL);
int time =(end.tv_sec *(unsigned int)1e6 + end.tv_usec) - (begin.tv_sec *(unsigned int)1e6 + begin.tv_usec);
printf(Microseconds elapsed CPU:%d \\\
,time);

cudaMemcpy(d1,idata,WORK_SIZE * sizeof(int),cudaMemcpyHostToDevice);



cudaEvent_t start,stop;
cudaEventCreate(& start);
cudaEventCreate(& stop);
cudaEventRecord(start,0);
int num_blocks = 16384;
bool flag = true;
while(num_blocks> 0){
if(flag){
red_min <<< num_blocks,128,128 * sizeof(int)>>(d1,d2)
}
else {
red_min <<< num_blocks,128,128 * sizeof(int)>>(d2,d1)
}
num_blocks / = 128;
flag =!flag; GT540M是一款移动图形处理器,它是一个移动图形处理器,所以我假设你在笔记本电脑上运行,此外你可能在540M GPU上托管X显示器。



我构建了一个完整的代码版本: / p>

  #include< stdio.h> 
#include< stdlib.h>
#include< thrust / sort.h>
#include< sys / time.h>
#include< omp.h>
#include< iostream>
#include< algorithm>

#define MIN(a,b)(((a)<(b))?(a):( b))



static const int WORK_SIZE = 2097152;



int find_min(int * a,int length){
int min = a [0]
for(int i = 1; i if(a [i] min = a [i]
return min;
}


__global__ static void red_min(int * g_idata,int * g_odata){
extern __shared__ int sdata [];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata [tid] = g_idata [i];
__syncthreads();

for(unsigned int s = blockDim.x / 2; s> 0; s>> = 1){
if(tid< s){
sdata [tid] = MIN(sdata [tid],sdata [tid + s]);
}
__syncthreads();
}
if(tid == 0)
g_odata [blockIdx.x] = sdata [0];
}





int main(void){
int * d1,* d2;
int i,* result;
int * idata,* fdata;
srand(time(NULL));
result =(int *)malloc(sizeof(int));
idata =(int *)malloc(WORK_SIZE * sizeof(int));
fdata =(int *)malloc(WORK_SIZE * sizeof(int));
cudaMalloc((int **)& d1,WORK_SIZE * sizeof(int));
cudaMalloc((int **)& d2,WORK_SIZE * sizeof(int));


for(i = 0; i idata [i] = rand();
fdata [i] = i;
}
struct timeval begin,end;
gettimeofday(& begin,NULL);
* result = find_min(idata,WORK_SIZE);
printf(Minimum Element CPU:%d \\\
,* result);
gettimeofday(& end,NULL);
int time =(end.tv_sec *(unsigned int)1e6 + end.tv_usec) - (begin.tv_sec *(unsigned int)1e6 + begin.tv_usec);
printf(Microseconds elapsed CPU:%d \\\
,time);

cudaMemcpy(d1,idata,WORK_SIZE * sizeof(int),cudaMemcpyHostToDevice);



cudaEvent_t start,stop;
cudaEventCreate(& start);
cudaEventCreate(& stop);
cudaEventRecord(start,0);
int num_blocks = 16384;
bool flag = true;
int loops = 0;
while(num_blocks> 0){
if(flag){
red_min <<< num_blocks,128,128 * sizeof(int)>>(d1,d2)
}
else {
red_min <<< num_blocks,128,128 * sizeof(int)>>(d2,d1)
}
num_blocks / = 128;
flag =!flag;
loops ++;
}
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
float et = 0.0f;
cudaEventElapsedTime(& et,start,stop);
printf(GPU time:%fms,in%d loops \\\
,et,loops);
int gpuresult;
if(flag)
cudaMemcpy(& gpuresult,d1,sizeof(int),cudaMemcpyDeviceToHost);
else
cudaMemcpy(& gpuresult,d2,sizeof(int),cudaMemcpyDeviceToHost);
printf(GPU min:%d \\\
,gpuresult);
return 0;
}

编译它:

  $ nvcc -O3 -arch = sm_20 -o t264 t264.cu 


$ b b

并在M2050 GPU,RHEL 5.5,CUDA 5.5,Xeon X5650 CPU上运行

  $ ./t264 
最小单元CPU:288
已经过去的微秒CPU:1217
GPU时间:0.621408ms,在3个循环中
GPU最小值:288
$

所以我的CPU结果非常接近你的,但我的GPU结果大约快5-6倍。如果我们将M2050与您的GT540M进行比较,我们可以看到M2050有14个SM,而GT540M有2个。更重要的是,M2050的内存带宽约为GT540M GPU的5倍(对于GT540M而言为28.8GB / s,理论值为〜150GB) / s峰值理论为M2050)



由于一个写得很好的并行缩减是GPU上的内存带宽约束代码,GPU和我的GPU之间的速度差异是有意义的。 / p>

所以我会说你的结果可能是关于什么是期望的,要获得更好的结果,你可能需要一个更快的GPU。



此外,如果你的GT540M也是一个X显示器,GPU的时序可能会被显示活动损坏。如果我们计时单个内核,这通常不是问题 - 内核执行简短地中断显示处理。但是当我们连续计算一系列内核时,显示任务可能会在内核调用之间跳转和执行(当要求GPU支持显示和处理CUDA代码时,GPU是多任务的) 。因此,这可能会对您的情况造成影响。


I have implemented this code: http://www.cuvilib.com/Reduction.pdf in order to calculate the sum of the elements of a matrix.

However in GPU it runs much slower than in CPU.

I got i7 processor and NVIDIA GT 540M graphics card.

Is it supposed to be that way or something else?

EDIT: I use version 3 of the above code in Ubuntu 13.04 and I compile it using Eclipse Nsight. The size of the matrix is 2097152 elements. It executes in 3.6 ms whereas the CPU version in around 1.0 ms. Below is the whole code:

#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <sys/time.h>
#include <omp.h>
#include <iostream>
#include <algorithm>

#define MIN(a,b) (((a)<(b))?(a):(b))



static const int WORK_SIZE = 2097152;



int find_min(int *a,int length){
  int min = a[0];
  for (int i=1;i<length;i++)
            if (a[i]<min)
        min=a[i];
  return min;
}


__global__ static void red_min(int *g_idata,int *g_odata) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    sdata[tid]= g_idata[i];
    __syncthreads();

    for(unsigned int s=blockDim.x/2; s > 0; s >>= 1) {
        if (tid<s) {
            sdata[tid] = MIN(sdata[tid],sdata[tid + s]);
        }
        __syncthreads();
    }
    if (tid == 0)
        g_odata[blockIdx.x] = sdata[0];
}





int main(void) {
    int *d1,*d2;
    int i,*result;
    int *idata,*fdata;
    srand ( time(NULL) );
    result = (int *)malloc(sizeof(int));
    idata = (int *)malloc(WORK_SIZE*sizeof(int));
    fdata = (int *)malloc(WORK_SIZE*sizeof(int));
    cudaMalloc((int**)&d1,WORK_SIZE*sizeof(int));
    cudaMalloc((int**)&d2,WORK_SIZE*sizeof(int));


    for (i = 0; i < WORK_SIZE; i++){
       idata[i] = rand();
       fdata[i] = i;
    }
    struct timeval begin, end;
    gettimeofday(&begin, NULL);
    *result = find_min(idata,WORK_SIZE);
    printf( "Minimum Element CPU: %d \n", *result);
    gettimeofday(&end, NULL);
    int time  =   (end.tv_sec * (unsigned int)1e6 +   end.tv_usec) - (begin.tv_sec *    (unsigned int)1e6 + begin.tv_usec);
    printf("Microseconds elapsed CPU: %d\n", time);

    cudaMemcpy(d1,idata,WORK_SIZE*sizeof(int),cudaMemcpyHostToDevice);



    cudaEvent_t start, stop;
    cudaEventCreate( &start);
    cudaEventCreate( &stop);
    cudaEventRecord(start,0);
    int num_blocks = 16384;
    bool flag = true;
    while (num_blocks>0){
        if (flag) {
            red_min<<<num_blocks,128,128*sizeof(int)>>>(d1,d2);
        }
        else {
            red_min<<<num_blocks,128,128*sizeof(int)>>>(d2,d1);
        }
        num_blocks /= 128;
        flag = !flag;
}

解决方案

GT540M is a mobile GPU, so I assume you're running on a laptop, and furthermore you may be hosting the X display on the 540M GPU.

I built a complete version of your code:

#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <sys/time.h>
#include <omp.h>
#include <iostream>
#include <algorithm>

#define MIN(a,b) (((a)<(b))?(a):(b))



static const int WORK_SIZE = 2097152;



int find_min(int *a,int length){
  int min = a[0];
  for (int i=1;i<length;i++)
            if (a[i]<min)
        min=a[i];
  return min;
}


__global__ static void red_min(int *g_idata,int *g_odata) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    sdata[tid]= g_idata[i];
    __syncthreads();

    for(unsigned int s=blockDim.x/2; s > 0; s >>= 1) {
        if (tid<s) {
            sdata[tid] = MIN(sdata[tid],sdata[tid + s]);
        }
        __syncthreads();
    }
    if (tid == 0)
        g_odata[blockIdx.x] = sdata[0];
}





int main(void) {
    int *d1,*d2;
    int i,*result;
    int *idata,*fdata;
    srand ( time(NULL) );
    result = (int *)malloc(sizeof(int));
    idata = (int *)malloc(WORK_SIZE*sizeof(int));
    fdata = (int *)malloc(WORK_SIZE*sizeof(int));
    cudaMalloc((int**)&d1,WORK_SIZE*sizeof(int));
    cudaMalloc((int**)&d2,WORK_SIZE*sizeof(int));


    for (i = 0; i < WORK_SIZE; i++){
       idata[i] = rand();
       fdata[i] = i;
    }
    struct timeval begin, end;
    gettimeofday(&begin, NULL);
    *result = find_min(idata,WORK_SIZE);
    printf( "Minimum Element CPU: %d \n", *result);
    gettimeofday(&end, NULL);
    int time  =   (end.tv_sec * (unsigned int)1e6 +   end.tv_usec) - (begin.tv_sec *    (unsigned int)1e6 + begin.tv_usec);
    printf("Microseconds elapsed CPU: %d\n", time);

    cudaMemcpy(d1,idata,WORK_SIZE*sizeof(int),cudaMemcpyHostToDevice);



    cudaEvent_t start, stop;
    cudaEventCreate( &start);
    cudaEventCreate( &stop);
    cudaEventRecord(start,0);
    int num_blocks = 16384;
    bool flag = true;
    int loops = 0;
    while (num_blocks>0){
        if (flag) {
            red_min<<<num_blocks,128,128*sizeof(int)>>>(d1,d2);
        }
        else {
            red_min<<<num_blocks,128,128*sizeof(int)>>>(d2,d1);
        }
        num_blocks /= 128;
        flag = !flag;
        loops++;
    }
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float et = 0.0f;
    cudaEventElapsedTime(&et, start, stop);
    printf("GPU time: %fms, in %d loops\n", et, loops);
    int gpuresult;
    if (flag)
      cudaMemcpy(&gpuresult, d1, sizeof(int), cudaMemcpyDeviceToHost);
    else
      cudaMemcpy(&gpuresult, d2, sizeof(int), cudaMemcpyDeviceToHost);
    printf("GPU min: %d\n", gpuresult);
    return 0;
}

compiled it:

$ nvcc -O3 -arch=sm_20 -o t264 t264.cu

and ran it on a M2050 GPU, RHEL 5.5, CUDA 5.5, Xeon X5650 CPU

$ ./t264
Minimum Element CPU: 288
Microseconds elapsed CPU: 1217
GPU time: 0.621408ms, in 3 loops
GPU min: 288
$

So my CPU results were pretty close to yours, but my GPU results were about 5-6x faster. If we compare M2050 to your GT540M, we see that the M2050 has 14 SMs whereas the GT540M has 2. More importantly, the M2050 has about 5x the memory bandwidth of your GT540M GPU (28.8GB/s peak theoretical for GT540M vs. ~150GB/s peak theoretical for M2050)

Since a well written parallel reduction is a memory bandwidth constrained code on GPUs, the speed difference between your GPU and my GPU makes sense.

So I would say your results are probably about what is expected, and to get better results you will probably need a faster GPU.

Also, if your GT540M is also hosting an X display, it's possible that the GPU timing is corrupted by display activity. If we are timing a single kernel, this is not normally an issue - the kernel execution interrupts the display processing briefly. But when we are timing a sequence of kernels in succession, it's possible for the display tasks to jump in and execute in-between kernel calls (the GPU is multi-tasking when it is asked to both support a display and also process CUDA code). Therefore, this may be a possible performance impact in your case as well.

这篇关于查找最小的GPU比CPU慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆