使用Cuda进行128位矢量加法时,性能问题 [英] 128-bit vector addition with Cuda, performance issue

查看:46
本文介绍了使用Cuda进行128位矢量加法时,性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想添加带有进位的128位向量.我的128位版本(下面的代码中的 addKernel128 )比基本的32位版本(下面的 addKernel32 )慢两倍.我有记忆合并问题吗?如何获得更好的性能?

I want to add 128-bit vectors with carry. My 128-bit version (addKernel128 in the code below) is twice slower than the basic 32-bit version (addKernel32 below). Do I have memory coalescing problems ? How can I get better performance ?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

#define UADDO(c, a, b) asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define UADDC(c, a, b) asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

__global__ void addKernel32(unsigned int *c, const unsigned int *a, const unsigned int *b, const int size)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  while (tid < size)
  {
    c[tid] = a[tid] + b[tid];
    tid += blockDim.x * gridDim.x;
  }
}

__global__ void addKernel128(unsigned *c, const unsigned *a, const unsigned *b, const int size)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  while (tid < size / 4)
  {
    uint4 a4 = ((const uint4 *)a)[tid],
          b4 = ((const uint4 *)b)[tid],
          c4;

    UADDO(c4.x, a4.x, b4.x)
    UADDC(c4.y, a4.y, b4.y) // add with carry
    UADDC(c4.z, a4.z, b4.z) // add with carry
    UADDC(c4.w, a4.w, b4.w) // add with carry (no overflow checking for clarity)

    ((uint4 *)c)[tid] = c4;

    tid += blockDim.x * gridDim.x;
  }
}

int main()
{
  const int size = 10000000; // 10 million

  unsigned int *d_a, *d_b, *d_c;

  cudaMalloc((void**)&d_a, size * sizeof(int));
  cudaMalloc((void**)&d_b, size * sizeof(int));
  cudaMalloc((void**)&d_c, size * sizeof(int));

  cudaMemset(d_a, 1, size * sizeof(int)); // dummy init just for the example
  cudaMemset(d_b, 2, size * sizeof(int)); // dummy init just for the example
  cudaMemset(d_c, 0, size * sizeof(int));

  int nbThreads = 512;
  int nbBlocks = 1024; // for example

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);

  addKernel128<<<nbBlocks, nbThreads>>>(d_c, d_a, d_b, size);

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float m = 0;
  cudaEventElapsedTime(&m, start, stop);

  cudaFree(d_c);
  cudaFree(d_b);
  cudaFree(d_a);
  cudaDeviceReset();
  printf("Elapsed = %g\n", m);
  return 0;
}

推荐答案

由于多种原因,在WDDM GPU上计时CUDA代码可能非常困难.其中大多数与以下事实有关:GPU已由Windows作为显示设备进行管理,这可能会将各种工件引入时序中.一个示例是Windows驱动程序和WDDM将为GPU批量工作,并且可能在CUDA GPU工作的中间插入显示工作.

Timing CUDA code on a WDDM GPU can be quite difficult for a variety of reasons. Most of these revolve around the fact that the GPU is being managed as a display device by Windows, and this can introduce a variety of artifacts into the timing. One example is that the windows driver and WDDM will batch work for the GPU, and may interleave display work in the middle of CUDA GPU work.

  • 如果可能的话,在Linux或Windows GPU上计时cuda代码在TCC模式下.
  • 为了性能,请始终在没有 -G 开关的情况下进行构建.在Visual Studio中,这通常对应于构建发行版,而不是项目的调试版本.
  • 为了获得良好的性能比较,通常建议在实际测量时序结果之前进行一些热身运行".如果您更有可能获得明智的结果,这些将消除启动"和其他一次性测量问题.您可能还希望多次运行代码并将结果平均化.
  • 通常也建议使用与您的GPU对应的arch标志进行编译,例如,对于cc2.0 GPU,例如 -arch = sm_20 .
  • if possible, time your cuda code on linux, or else on a windows GPU in TCC mode.
  • for performance, always build without the -G switch. In visual studio, this usually corresponds to building the release, not the debug version of the project.
  • To get a good performance comparison, it's usually advisable to do some "warm up runs" before actually measuring the timing results. These will eliminate "start-up" and other one-time measurement issues, are you are more likely to get sensible results. You may also wish to run your code a number of times and average the results.
  • It's also usually advisable to compile with an arch flag that corresponds to your GPU, so for example -arch=sm_20 for a cc2.0 GPU.

这篇关于使用Cuda进行128位矢量加法时,性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆