GPU没有在Julia集计算中提高性能 [英] GPU gives no performance improvement in Julia set computation

查看:155
本文介绍了GPU没有在Julia集计算中提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试比较CPU和GPU的性能.我有

I am trying to compare performance in CPU and GPU. I have

  • CPU:英特尔®酷睿™i5 CPU M 480 @ 2.67GHz×4
  • GPU:NVidia GeForce GT 420M

我可以确认GPU已配置并且可以在CUDA上正常使用.

I can confirm that GPU is configured and works correctly with CUDA.

我正在实现Julia集计算. http://en.wikipedia.org/wiki/Julia_set 基本上对于每个像素,如果坐标在集合中,它将被涂成红色 否则把它漆成白色.

I am implementing Julia set computation. http://en.wikipedia.org/wiki/Julia_set Basically for every pixel, if the co-ordinate is in the set it will paint it red else paint it white.

尽管,我在CPU和GPU上都得到了相同的答案,但没有得到 性能提升,使用GPU会导致性能下降.

Although, I get identical answer with both CPU and GPU but instead of getting a performance improvement, I get a performance penalty by using GPU.

运行时间

  • CPU:0.052s
  • GPU:0.784秒

我知道从设备到主机的数据传输会花费一些时间. 但是,我怎么知道使用GPU是否真的有好处?

I am aware that transferring data from device to host can take up some time. But still, how do I know if use of GPU is actually beneficial?

这是相关的GPU代码

    #include <stdio.h>
    #include <cuda.h>

    __device__ bool isJulia( float x, float y, float maxX_2, float maxY_2 )
    {
        float z_r = 0.8 * (float) (maxX_2 - x) / maxX_2;
        float z_i = 0.8 * (float) (maxY_2 - y) / maxY_2;

        float c_r = -0.8;
        float c_i = 0.156;
        for( int i=1 ; i<100 ; i++ )
        {
        float tmp_r = z_r*z_r - z_i*z_i + c_r;
        float tmp_i = 2*z_r*z_i + c_i;

        z_r = tmp_r;
        z_i = tmp_i;

        if( sqrt( z_r*z_r + z_i*z_i ) > 1000 )
            return false;
        }
        return true;
    }

    __global__ void kernel( unsigned char * im, int dimx, int dimy )
    {
        //int tid = blockIdx.y*gridDim.x + blockIdx.x;
        int tid = blockIdx.x*blockDim.x + threadIdx.x;
        tid *= 3;
        if( isJulia((float)blockIdx.x, (float)threadIdx.x, (float)dimx/2, (float)dimy/2)==true )
        {
        im[tid] = 255;
        im[tid+1] = 0;
        im[tid+2] = 0;
        }
        else
        {
        im[tid] = 255;
        im[tid+1] = 255;
        im[tid+2] = 255;
        }

    }

    int main()
    {
        int dimx=768, dimy=768;

        //on cpu
        unsigned char * im = (unsigned char*) malloc( 3*dimx*dimy );

        //on GPU
        unsigned char * im_dev;

        //allocate mem on GPU
        cudaMalloc( (void**)&im_dev, 3*dimx*dimy ); 

        //launch kernel. 
**for( int z=0 ; z<10000 ; z++ ) // loop for multiple times computation**
{
        kernel<<<dimx,dimy>>>(im_dev, dimx, dimy);
}

        cudaMemcpy( im, im_dev, 3*dimx*dimy, cudaMemcpyDeviceToHost );

        writePPMImage( im, dimx, dimy, 3, "out_gpu.ppm" ); //assume this writes a ppm file

        free( im );
        cudaFree( im_dev );
    }

这是CPU代码

    bool isJulia( float x, float y, float maxX_2, float maxY_2 )
    {
        float z_r = 0.8 * (float) (maxX_2 - x) / maxX_2;
        float z_i = 0.8 * (float) (maxY_2 - y) / maxY_2;

        float c_r = -0.8;
        float c_i = 0.156;
        for( int i=1 ; i<100 ; i++ )
        {
        float tmp_r = z_r*z_r - z_i*z_i + c_r;
        float tmp_i = 2*z_r*z_i + c_i;

        z_r = tmp_r;
        z_i = tmp_i;

        if( sqrt( z_r*z_r + z_i*z_i ) > 1000 )
            return false;
        }
        return true;
    }


    #include <stdlib.h>
    #include <stdio.h>

    int main(void)
    {
      const int dimx = 768, dimy = 768;
      int i, j;

      unsigned char * data = new unsigned char[dimx*dimy*3];

**for( int z=0 ; z<10000 ; z++ ) // loop for multiple times computation**
{
      for (j = 0; j < dimy; ++j)
      {
        for (i = 0; i < dimx; ++i)
        {
          if( isJulia(i,j,dimx/2,dimy/2) == true )
          {
          data[3*j*dimx + 3*i + 0] = (unsigned char)255;  /* red */
          data[3*j*dimx + 3*i + 1] = (unsigned char)0;  /* green */
          data[3*j*dimx + 3*i + 2] = (unsigned char)0;  /* blue */
          }
          else
          {
          data[3*j*dimx + 3*i + 0] = (unsigned char)255;  /* red */
          data[3*j*dimx + 3*i + 1] = (unsigned char)255;  /* green */
          data[3*j*dimx + 3*i + 2] = (unsigned char)255;  /* blue */
          }
        }
      }
}

      writePPMImage( data, dimx, dimy, 3, "out_cpu.ppm" ); //assume this writes a ppm file
      delete [] data


      return 0;
    }

此外,根据@hyde的建议,我循环了仅计算部分以生成10,000张图像.不过,我不愿意写所有这些图像.计算只是我在做什么.

Further, following suggestions from @hyde I have looped the computation-only part to generate 10,000 images. I am not bothering to write all those images though. Computation only is what I am doing.

这是运行时间

  • CPU:超过10分钟,代码仍在运行
  • GPU:1m 14.765s

推荐答案

将注释转为答案:

要获取相关数字,您需要计算多个图像,因此执行时间至少为几秒或几十秒.此外,在结果中包含文件节省时间会增加噪音并隐藏CPU与GPU的实际差异.

To get relevant figures, you needs to calculate more than one image, so that execution time is seconds or tens of seconds at least. Also, including file saving time in results is going to add noise and hide the actual CPU vs GPU difference.

获得真实结果的另一种方法是选择一个Julia集,该集具有属于该集的很多点,然后将迭代计数提高到如此之高,以至于仅计算一张图像就需要很多秒.然后只有一个单一的计算设置,因此这可能是GPU/CUDA的最有利方案.

Another way to get real results is to select a Julia set which has lot points belonging to the set, then upping the iteration count so high it takes many seconds to calculate just one image. Then there is only one single calculation setup, so this is likely to be the most advantageous scenario for GPU/CUDA.

要测量有多少开销,请将图像大小更改为1x1,将迭代限制更改为1,然后计算足够的图像,至少需要几秒钟的时间.在这种情况下,GPU可能会明显变慢.

To measure how much overhead there is, change image size to 1x1 and iteration limit 1, and then calculate enough images that it takes at least a few seconds. In this scenario, GPU is likely significantly slower.

要获得与您的用例最相关的时序,请选择您真正要使用的图像大小和迭代计数,然后测量图像计数,这两个版本的速度都相当快.这将为您提供粗略的经验法则,以决定何时应使用哪个.

To get most relevant timings for your use case, select image size and iteration count you are really going to use, and then measure the image count, where both versions are equally fast. That will give you a rough rule-of-thumb to decide which you should use when.

如果仅获得一张图像,则为获得实际结果的替代方法:找到单个最差情况图像的迭代极限,其中CPU和GPU的速度相同.如果有利于进行多次迭代,请选择GPU,否则请选择CPU.

Alternative approach for practical results, if you are going to get just one image: find the iteration limit for single worst-case image, where CPU and GPU are equally fast. If that many or more iterations would be advantageous, choose GPU, otherwise choose CPU.

这篇关于GPU没有在Julia集计算中提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆