使用 Cuda 进行 128 位向量加法,性能问题 [英] 128-bit vector addition with Cuda, performance issue

查看:13
本文介绍了使用 Cuda 进行 128 位向量加法,性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想添加带有进位的 128 位向量.我的 128 位版本(下面代码中的 addKernel128)比基本 32 位版本(下面的 addKernel32)慢两倍.我有内存合并问题吗?如何获得更好的性能?

I want to add 128-bit vectors with carry. My 128-bit version (addKernel128 in the code below) is twice slower than the basic 32-bit version (addKernel32 below). Do I have memory coalescing problems ? How can I get better performance ?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>

#define UADDO(c, a, b) asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
#define UADDC(c, a, b) asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

__global__ void addKernel32(unsigned int *c, const unsigned int *a, const unsigned int *b, const int size)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  while (tid < size)
  {
    c[tid] = a[tid] + b[tid];
    tid += blockDim.x * gridDim.x;
  }
}

__global__ void addKernel128(unsigned *c, const unsigned *a, const unsigned *b, const int size)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  while (tid < size / 4)
  {
    uint4 a4 = ((const uint4 *)a)[tid],
          b4 = ((const uint4 *)b)[tid],
          c4;

    UADDO(c4.x, a4.x, b4.x)
    UADDC(c4.y, a4.y, b4.y) // add with carry
    UADDC(c4.z, a4.z, b4.z) // add with carry
    UADDC(c4.w, a4.w, b4.w) // add with carry (no overflow checking for clarity)

    ((uint4 *)c)[tid] = c4;

    tid += blockDim.x * gridDim.x;
  }
}

int main()
{
  const int size = 10000000; // 10 million

  unsigned int *d_a, *d_b, *d_c;

  cudaMalloc((void**)&d_a, size * sizeof(int));
  cudaMalloc((void**)&d_b, size * sizeof(int));
  cudaMalloc((void**)&d_c, size * sizeof(int));

  cudaMemset(d_a, 1, size * sizeof(int)); // dummy init just for the example
  cudaMemset(d_b, 2, size * sizeof(int)); // dummy init just for the example
  cudaMemset(d_c, 0, size * sizeof(int));

  int nbThreads = 512;
  int nbBlocks = 1024; // for example

  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaEventRecord(start);

  addKernel128<<<nbBlocks, nbThreads>>>(d_c, d_a, d_b, size);

  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  float m = 0;
  cudaEventElapsedTime(&m, start, stop);

  cudaFree(d_c);
  cudaFree(d_b);
  cudaFree(d_a);
  cudaDeviceReset();
  printf("Elapsed = %g
", m);
  return 0;
}

推荐答案

由于各种原因,在 WDDM GPU 上对 CUDA 代码进行计时可能非常困难.其中大部分都围绕着这样一个事实,即 Windows 将 GPU 作为显示设备进行管理,这可能会在时序中引入各种伪影.一个例子是 windows 驱动程序和 WDDM 将为 GPU 批量工作,并可能在 CUDA GPU 工作的中间交错显示工作.

Timing CUDA code on a WDDM GPU can be quite difficult for a variety of reasons. Most of these revolve around the fact that the GPU is being managed as a display device by Windows, and this can introduce a variety of artifacts into the timing. One example is that the windows driver and WDDM will batch work for the GPU, and may interleave display work in the middle of CUDA GPU work.

  • 如果可能,请在 linux 或 Windows GPU 上为您的 cuda 代码计时在 TCC 模式下.
  • 为了提高性能,始终在构建时不使用 -G 开关.在 Visual Studio 中,这通常对应于构建版本,而不是项目的调试版本.
  • 为了获得良好的性能比较,通常建议在实际测量计时结果之前进行一些热身运行".这些将消除启动"和其他一次性测量问题,您更有可能获得明智的结果.您可能还希望多次运行代码并对结果取平均值.
  • 通常还建议使用与您的 GPU 对应的 arch 标志进行编译,例如 -arch=sm_20 用于 cc2.0 GPU.
  • if possible, time your cuda code on linux, or else on a windows GPU in TCC mode.
  • for performance, always build without the -G switch. In visual studio, this usually corresponds to building the release, not the debug version of the project.
  • To get a good performance comparison, it's usually advisable to do some "warm up runs" before actually measuring the timing results. These will eliminate "start-up" and other one-time measurement issues, are you are more likely to get sensible results. You may also wish to run your code a number of times and average the results.
  • It's also usually advisable to compile with an arch flag that corresponds to your GPU, so for example -arch=sm_20 for a cc2.0 GPU.

这篇关于使用 Cuda 进行 128 位向量加法,性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆