使用CUDA模拟管道程序 [英] Simulating pipeline program with CUDA

查看:78
本文介绍了使用CUDA模拟管道程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有两个数组 A B 和一个 kernel1 通过将两个数组分成不同的块,对两个数组进行一些计算(例如,向量加法),并将部分结果写入 C kernel1 然后继续执行此操作,直到处理完数组中的所有元素为止。

Say I have two arrays A and B and a kernel1 that does some calculation on both arrays (vector addition for example) by breaking the arrays into different chunks and and writes the partial result to C. kernel1 then keeps doing this until all elements in the arrays are processed.

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int gridSize = blockDim.x*gridDim.x;

//iterate through each chunk of gridSize in both A and B
while (i < N) {
     C[i] = A[i] + B[i];
     i += gridSize;
}

说,现在我要启动 kernel2 C 和另一个数据数组 D 上。无论如何,在计算 C 中的第一个块之后,我是否可以立即启动 kernel2 ?本质上, kernel1 piped 会导致 kernel2 。依赖关系树看起来像这样

Say, now I want to launch a kernel2 on C and another data array D. Is there anyway I can start kernel2 immediately after the first chunk in C is calculated? In essence, kernel1 piped it result to kernel2. The dependency tree would look like this

       Result
       /  \
      C    D
    /  \    
   A    B     

我曾经考虑过使用CUDA流,但不确定如何使用。

I have thought about using CUDA streams but not sure exactly how. Maybe incorporating the host in calculation?

推荐答案

是的,您可以使用 CUDA流来管理这种情况下的顺序和依赖性。

Yes, you could use CUDA streams to manage order and dependencies in such a scenario.

让我们假设您要重叠复制和计算操作。这通常意味着您将输入数据分解为块,并将块复制到设备,然后启动计算操作。每次内核启动都对大块数据进行操作。

Let's assume that you will want to overlap the copy and compute operations. This typically implies that you will break your input data into "chunks" and you will copy chunks to the device, then launch compute operations. Each kernel launch operates on a "chunk" of data.

我们可以使用主机代码中的循环来管理进程:

We could manage the process with a loop in host code:

// create streams and ping-pong pointer
cudaStream_t stream1, stream2, *st_ptr;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);
// assume D is already on device as dev_D
for (int chunkid = 0; chunkid < max; chunkid++){
  //ping-pong streams
  st_ptr = (chunkid % 2)?(&stream1):(&stream2);
  size_t offset = chunkid*chunk_size;
  //copy A and B chunks
  cudaMemcpyAsync(dev_A+offset, A+offset, chksize*sizeof(A_type), cudaMemcpyHostToDevice, *st_ptr);
  cudaMemcpyAsync(dev_B+offset, B+offset, chksize*sizeof(B_type), cudaMemcpyHostToDevice, *st_ptr);
  // then compute C based on A and B
  compute_C_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_A+offset, dev_B+offset, chksize);
  // then compute Result based on C and D
  compute_Result_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_D, chksize);
  // could copy a chunk of Result back to host here with cudaMemcpyAsync on same stream
  }

保证发给相同流的所有操作在设备上依次(即顺序)执行。发布给单独流的操作可以重叠。因此,以上序列应:

All operations issued to the same stream are guaranteed to execute in order (i.e. sequentially) on the device. Operations issued to separate streams can overlap. Therefore the above sequence should:


  • 将A的一部分复制到设备上

  • B到设备

  • 启动内核以根据A和B计算C

  • 启动内核以根据C和D计算结果

  • copy a chunk of A to the device
  • copy a chunk of B to the device
  • launch a kernel to compute C from A and B
  • launch a kernel to compute Result from C and D

将对每个块重复上述步骤,但是连续的块操作将发布到备用流。因此,块2的复制操作可以与块1的内核操作重叠,等等。

The above steps will be repeated for each chunk, but successive chunk operations will be issued to alternate streams. Therefore the copy operations of chunk 2 can overlap with the kernel operations from chunk 1, etc.

您可以通过查看有关CUDA流的演示来了解更多信息。 此处就是一个例子。

You can learn more by reviewing a presentation on CUDA streams. Here is one example.

较新的设备(Kepler和Maxwell)对于见证设备上的操作重叠所需的程序发布顺序应该相当灵活。较旧的(Fermi)设备可能对发布顺序很敏感。您可以在此处

Newer devices (Kepler and Maxwell) should be fairly flexible about the program-issue-order needed to witness overlap of operations on the device. Older (Fermi) devices may be sensitive to issue order. You can read more about that here

这篇关于使用CUDA模拟管道程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆