流调度顺序 [英] Stream scheduling order

查看:163
本文介绍了流调度顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到Process One&过程二(下面),等同于它们花费相同的时间量。我错了?

  allOfData_A = data_A1 + data_A2 
allOfData_B = data_B1 + data_B2
allOFData_C = data_C1 + data_C2
Data_C是Data_A& C的内核操作的输出。 Data_B。 (像C = A + B)
硬件支持一个DeviceOverlap(并发)操作。

过程一:

  MemcpyAsync data_A1 stream1 H-> D 
MemcpyAsync data_A2 stream2 H-> D
MemcpyAsync data_B1 stream1 H-> D
MemcpyAsync data_B2 stream2 H-> D
sameKernel stream1
sameKernel stream2
MemcpyAsync result_C1 stream1 D-> H
MemcpyAsync result_C2 stream2 D-> H

过程二:(相同的操作,不同的顺序)

  MemcpyAsync data_A1 stream1 H-> D 
MemcpyAsync data_B1 stream1 H-> D
sameKernel stream1
MemcpyAsync data_A2 stream2 H-> D
MemcpyAsync data_B2 stream2 H-> D
sameKernel stream2
MemcpyAsync result_C1 stream1 D-> H
MemcpyAsync result_C2 stream2 D-> H


解决方案

使用CUDA流允许程序员通过将依赖操作放在同一个流中来表达工作依赖。



在没有HyperQ的GPU(计算能力1.0到3.0)上,您可以获得false依赖性因为DMA引擎或计算的工作放在单个硬件管道中。计算能力3.5带来HyperQ,允许多个硬件管道,你不应该得到假依赖。 simpleHyperQ 示例说明了这一点,并且文档显示图表以更清楚地解释发生了什么。

简单来说,在没有HyperQ的设备上,你需要做一个广度优先的工作,以获得最大的并发性,而对于使用HyperQ的设备,你可以做深度优先的启动。避免假依赖是很容易的,但不必担心它更容易!


The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong?

allOfData_A= data_A1 + data_A2
allOfData_B= data_B1 + data_B2
allOFData_C= data_C1 + data_C2
Data_C is the output of the kernel operation of both Data_A & Data_B.  (Like C=A+B)
The HW supports one DeviceOverlap (concurrent) operation.

Process One:

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B1 stream1 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream1
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

Process Two: (Same operation, different order)

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_B1 stream1 H->D
sameKernel stream1
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

解决方案

Using CUDA streams allows the programmer to express work dependencies by putting dependent operations in the same stream. Work in different streams is independent and can be executed concurrently.

On GPUs without HyperQ (compute capability 1.0 to 3.0) you can get false dependencies because the work for a DMA engine or for computation gets put into a single hardware pipe. Compute capability 3.5 brings HyperQ which allows for multiple hardware pipes and there you shouldn't get the false dependencies. The simpleHyperQ example illustrates this, and the documentation shows diagrams to explain what is going on more clearly.

Putting it simply, on devices without HyperQ you would need to do a breadth-first launch of your work to get maximum concurrency, whereas for devices with HyperQ you can do a depth-first launch. Avoiding the false dependencies is pretty easy, but not having to worry about it is easier!

这篇关于流调度顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆