用于CUDA目标的Numba和guvectorize:代码运行速度比预期的慢 [英] Numba and guvectorize for CUDA target: Code running slower than expected

查看:181
本文介绍了用于CUDA目标的Numba和guvectorize:代码运行速度比预期的慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

值得注意的细节




  • 大型数据集(1000万x 5),(200 x 1000万x 5)

  • 多数为脾气

  • 每次运行后需要更长的时间

  • 使用Spyder3

  • Windows 10



第一件事是尝试通过以下功能使用guvectorize。我传入了一堆numpy数组,并尝试使用它们在两个数组之间进行乘法运算。如果与cuda以外的目标一起运行,这将起作用。但是,当切换到cuda时,会导致出现未知错误:


文件 C:\ProgramData\Anaconda3\lib\ \site-packages\numba\cuda\decorators.py,行82,在jitwrapper中
debug = debug)



TypeError: init ()得到了意外的关键字参数'debug'


除了死胡同,我什么都没打。我猜这是一个非常简单的修复程序,我完全没想到,但是哦。还应该说此错误仅在运行一次并由于内存过载而崩溃后才会发生。

  os.environ [ NUMBA_ENABLE_CUDASIM] = 1 

os.environ [ CUDA_VISIBLE_DEVICES] = 10DE 1B06 63933842
...

所有数组均为numpy

  @guvectorize(['void(int64,float64 [:,:],float64 [:,:],float64 [:,:,:],
int64,int64,float64 [:,:, :])'],'(),(m,o),(m,o),(n,m,o),(),()->(n,m,o)',
target ='cuda',nopython = True)
def cVestDiscount(ed,orCV,vals,discount,n,rowCount,cv):
as_of_date在范围(0,ed)中:
对于范围(0,rowCount)中的ID:
对于范围(0,n)中的num:
cv [as_of_date] [ID] [num] = orCV [ID] [num] *折扣[ as_of_date] [ID] [num]

尝试在命令行中使用nvprofiler运行代码会导致出现以下错误:


W警告:当前的
配置不支持统一内存分析,因为在此多GPU设置中检测到一对不支持对等支持的设备。当对等映射不可用时,系统会退回到使用零拷贝内存。这可能会导致访问统一内存的
内核运行速度变慢。可以在以下位置找到
的更多详细信息:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory


我意识到我使用的是启用SLI的图形卡(两张卡都是相同的,evga gtx 1080ti,并且具有相同的设备ID),因此我禁用了SLI并添加了 CUDA_VISIBLE_DEVICES 行尝试限制到另一张卡,但结果却相同。



我仍然可以使用nvprof运行代码,但是相比之下,cuda函数的运行速度较慢到njit(parallel = True)和prange。通过使用较小的数据量,我们可以运行代码,但是它比target ='parallel'和target ='cpu'慢。



为什么cuda这么慢,这些错误是什么意思?



感谢帮助!



编辑:
此处是代码的工作示例:

  import numpy as numba import guvectorize 
的导入时间为np
从timeit导入
作为计时器


@guvectorize(['void(int64,float64 [:,:],float64 [:,:,:],int64,int64 ,float64 [:,:,:])'],'(),(m,o),(n,m,o),(),()->(n,m,o)',target = 'cuda',nopython = True)
def cVestDiscount(countRow,multBy,Discount,n,countCol,cv):
表示日期范围(0,countRow):
表示ID范围(0,countCol):
表示范围内的num(0,n):
cv [as_of_date] [ID] [num] = multBy [ID] [num] *折扣[as_of_date] [ID] [num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5) )
折扣= np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape =(100,4000,5),dtype = np.float64)
func_start = timer()
cv = cVestDiscount(countRow,multBy,Discount,n ,countCol,cv)
Timing = timer()-func_start
print( Function:Discount factor cumVest duration(seconds :: + str(timing))

我可以使用gtx 1080ti在cuda中运行代码,但是,它比并行或cpu运行慢得多。我看过其他有关guvectorize的文章,但没有一篇文章帮助我了解在guvectorize中运行的最佳条件。有什么方法可以使此代码对代码友好,或者只是对数组进行乘法过于简单,以至于看不到任何好处?

解决方案

首先,您显示的基本操作是取两个矩阵,将它们转移到GPU,进行一些元素乘法以生成第3个数组,然后将该第3个数组传递回主机。



可能有可能使numba / cuda guvectorize(或cuda.jit内核)实现比天真的串行python实现运行得更快,但我怀疑是否有可能超越编写良好的主机代码(例如使用某种并行化方法,例如guvectorize)的性能可以完成相同的操作。这是因为主机和设备之间传输的每字节算术强度太低。此操作太简单了。



第二,我认为重要的是,首先要了解什么是numba vectorize guvectorize 是要这样做的。基本原则是从工人将做什么?的角度编写ufunc定义。然后允许numba从中衍生出多个工人。您指示numba旋转多个工作程序的方式是传递一个大于给定签名的数据集。应该注意的是 numba不知道如何在ufunc定义内并行化for循环。通过获取您的ufunc定义并在并行工作程序中运行它,可以得到并行的强度,其中每个工作程序处理数据的切片,但在该切片上运行您的整个ufunc定义。作为补充阅读,我已经介绍了一些这样的基础知识



因此,您在实现时遇到的一个问题是您已经写了签名(和ufunc)它将整个输入数据集映射到单个工作程序。如@talonmies所示,您的底层内核总共有64个线程/工作程序(即使上面关于算术强度的声明除外),在GPU上有趣的程度还是很小的,但是我怀疑64实际上只是一个最小的线程块大小,实际上该线程块中只有1个线程正在执行任何有用的计算工作。那个线程正在以串行方式执行您的整个ufunc,包括所有的for循环。



显然,这不是任何人想要合理使用<$ c $的意图c> vectorize 或 guvectorize



因此,让我们重新回顾一下您想做的事情。最终,您的ufunc希望将一个数组的输入值乘以另一个数组的输入值,并将结果存储在第3个数组中。我们想重复这个过程很多次。如果所有三个数组大小都相同,我们实际上可以使用 vectorize 来实现,甚至不必求助于更复杂的 guvectorize 。让我们将这种方法与您的原始方法进行比较,重点放在CUDA内核执行上。这是一个可行的示例,其中t14.py是您的原始代码,并使用事件探查器运行,而t15.py是它的 vectorize 版本,表示我们已更改了大小 multBy 数组中的一个,以匹配 cv discount

  $ nvprof --print-gpu-trace python t14.py 
== 4145 == NVPROF正在分析过程4145 ,命令:python t14.py
功能:折扣因子cumVest持续时间(秒):1.24354910851
== 4145 ==分析应用程序:python t14.py
== 4145 ==分析结果:
开始持续时间网格大小块大小Regs * SSMem * DSMem *大小吞吐量SrcMemType DstMemType设备上下文流名称
312.36ms 1.2160us-----8B 6.2742MB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
312.81ms 27.392us-- ---156.25KB 5.4400GB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
313.52ms 5.8696ms-----15.259MB 2.5387GB / s可分页设备Quadro K2000(0 1 7 [ CUDA memcpy HtoD]
319.74ms 1.0880us-----8B 7.0123MB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
319.93ms 896ns-----8B 8.5149MB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
321.40ms 1.22538s(1 1 1)(64 1 1)63 0B 0B----Quadro K2000(0 1 7 cudapy :: __ main__ :: __ gufunc_cVestDiscount $ 242(Array< __ int64,int = 1,A,可变,aligned> ;, Array< double,i nt = 3,A,可变,对齐,数组)。[37]
1.54678s 7.1816ms-----15.259MB 2.0749GB / s设备可分页Quadro K2000(0 1 7 [CUDA内存DtoH]

Regs:每个CUDA线程使用的寄存器数。该数字包括CUDA驱动程序和/或工具内部使用的寄存器,并且可能比编译器显示的更多。
SSMem:每个CUDA块分配的静态共享内存。
DSMem:每个CUDA块分配的动态共享内存。
SrcMemType:由内存操作/副本访问的源内存的类型
DstMemType:由内存操作/副本访问的目标内存的类型
$ cat t15.py
import numpy as np
从numba导入guvectorize,矢量化
导入时间
从timeit导入default_timer作为计时器


@vectorize(['float64(float64,float64) '],target ='cuda')
def cVestDiscount(a,b):
返回a * b

折扣= np.float64(np.arange(2000000)。重塑(100,4000,5))
multBy = np.full_like(折扣,1)
cv = np.empty_like(折扣)
func_start = timer()
cv = cVestDiscount(multBy,折扣)
Timing = timer()-func_start
print( Function:Discount factor cumVest duration(seconds): + str(timing))
$ nvprof --print -gpu-trace python t15.py
== 4167 == NVPROF正在分析进程4167,命令:python t15.py
功能:折扣系数cumVest持续时间(秒):0.37507891655
== 4167 ==分析应用程序:python t1 5.py
== 4167 ==分析结果:
开始持续时间网格大小块大小Regs * SSMem * DSMem *大小吞吐量SrcMemType DstMemType设备上下文流名称
193.92ms 6.2729ms--- --15.259MB 2.3755GB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
201.09ms 5.7101ms-----15.259MB 2.6096GB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
364.92ms 842.49us(15625 1 1)(128 1 1)13 0B 0B----Quadro K2000(0 1 7 cudapy :: __ main __ :: __ vectorized_cVestDiscount $ 242(Array< double,int = 1 ,A,可变,对齐>,数组,Array ))[31]
365.77ms 7.15 28ms-----15.259MB 2.0833G​​B / s设备可分页Quadro K2000(0 1 7 [CUDA memcpy DtoH]

Regs:每个CUDA线程使用的寄存器数。该数字包括CUDA驱动程序和/或工具内部使用的寄存器,并且可能比编译器显示的更多。
SSMem:每个CUDA块分配的静态共享内存。
DSMem:每个CUDA块分配的动态共享内存。
SrcMemType:内存操作/副本访问的源内存的类型
DstMemType:内存操作/副本访问的目标内存的类型
$

我们看到您的应用程序报告的运行时间约为1.244秒,而矢量化版本报告的运行时间约为0.375秒。但是这两个数字都有python开销。如果我们在探查器中查看生成的CUDA内核持续时间,则差异更加明显。我们看到原始内核花费了约1.225秒,而矢量化内核执行了约842微秒(即不到1毫秒)。我们还注意到,现在的计算内核时间比将3个数组转移到GPU或从GPU转移3个数组所需的时间小得多(总共花费约20毫秒),并且我们注意到内核尺寸现在是15625个块(128个)每个线程的总线程数/工作人员总数为2000000,与要执行的乘法运算的总数完全匹配,并且远远超过使用原始代码的微不足道的64个线程(可能实际上只有1个线程)。



鉴于上述 vectorize 方法的简单性,如果您真正想做的是此元素级乘法,则您可能会考虑只复制 multBy ,以便它在尺寸上与其他两个数组匹配,并完成此操作。



但是问题仍然存在:如何像原始问题一样处理不同的输入数组大小?为此,我认为我们需要转到 guvectorize (或者,如@talonmies所示,编写您自己的 @ cuda.jit 内核,尽管这些方法都有可能无法克服与设备之间的数据传输开销,但是这可能是最好的建议。



为了通过 guvectorize 解决这个问题,我们需要更仔细地考虑已经提到的切片概念。让我们重新编写您的 guvectorize 内核,使其仅对整体数据进行切片,然后允许 guvectorize 启动功能可以旋转多个工人来解决它,每个切片一个工人。



在CUDA中,我们喜欢有很多工人;你真的不能有太多。因此,这将影响我们如何分割数组,从而使多个工作人员可以采取行动。如果我们要沿第3个(最后一个, n )维度进行切片,则只能使用5个切片,因此最多只能有5个工人。同样,如果我们沿第一个维或 countRow 维进行切片,我们将有100个切片,因此最多100个工作人员。理想情况下,我们将沿第二维或 countCol 维切片。但是,为简单起见,我将沿第一个维度或 countRow 维度切片。这显然不是最佳选择,但请参见下面的示例,了解如何处理按秒切片问题。按第一个维度进行切片意味着我们将从guvectorize内核中删除第一个for循环,并允许ufunc系统沿着该维度进行并行化(基于传递的数组大小)。代码可能看起来像这样:

  $ cat t16.py 
numpy as np
from numba import guvectorize
从timeit import default_timer作为计时器


@guvectorize(['void(float64 [:,:],float64 [:,:: ],int64,int64,float64 [:,:])'],'(m,o),(m,o),(),()->(m,o)',target ='cuda', nopython = True)
def cVestDiscount(multBy,Discount,n,countCol,cv):
用于范围(0,countCol)中的ID:
用于范围在(0,n)中的num:
cv [ID] [num] = multBy [ID] [num] * discount [ID] [num]

multBy = np.float64(np.arange(20000).reshape( 4000,5))
折扣= np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np。 int64(4000)
cv = np.zeros(shape =(100,4000,5),dtype = np.float64)
func_start = timer()
cv = cVestDiscount(multBy,折扣,n,countCol,cv)
Timing = timer()-func_start
print( Function:折扣系数cumVest持续时间(秒): + str(timing))
$ nvprof --print-gpu-trace python t16.py
== 4275 == NVPROF正在分析过程4275,命令:python t16.py
功能:折扣因子cumVest持续时间(秒):0.0670170783997
== 4275 ==分析应用程序:python t16.py
== 4275 ==分析结果:
开始时间网格大小块大小regs * SSMem * DSMem *大小吞吐量SrcMemType DstMemType设备上下文流名称
307.05ms 27.392us-----156.25KB 5.4400GB / s Pageable Device Quadro K2000(0 1 7 [CUDA memcpy HtoD]
307.79ms 5.9293 ms-----15.131MB 2.5131GB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
314.34ms 1.3440us-----8B 5.6766MB / s可分页设备Quadr o K2000(0 1 7 [CUDA memcpy HtoD]
314.54ms 896ns-----8B 8.5149MB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
317.27ms 47.398ms( 2 1 1)(64 1 1)63 0B 0B----Quadro K2000(0 1 7 cudapy :: __ main __ :: __ gufunc_cVestDiscount $ 242(Array< double,int = 3,A,mutable,aligned> ;, Array< double, int = 3,A,可变,对齐,数组)[35]
364.67ms 7.3799ms-----15.259MB 2.0192GB / s设备可分页Quadro K2000(0 1 7 [CUDA memcpy DtoH]

Regs :每个CUDA线程使用的寄存器数。该数字包括CUDA驱动程序和/或工具内部使用的寄存器,并且可能比编译器显示的更多。
SSMem:每个CUDA块分配的静态共享内存。
DSMem:每个CUDA块分配的动态共享内存。
SrcMemType:内存操作/副本访问的源内存的类型
DstMemType:内存操作/副本访问的目标内存的类型
$

观察:


  1. 与代码更改相关删除 countCol 参数,从guvectorize内核中删除第一个for循环,并对函数签名进行适当的更改以反映这一点。我们还将签名中的3维函数 修改为二维。毕竟,我们正在对3维数据进行二维切片,并让每个工作人员在一个切片上工作。


  2. 内核尺寸如探查器所报告的,现在是2个块而不是1个。这是有道理的,因为在最初的实现中,实际上只显示了1个切片,因此需要1个工作线程,因此需要1个线程(但是numba旋转了1个线程块64个线程)。在此实现中,有100个切片,并且numba选择启动2个线程块,每个线程块包含64个工作线程/线程,以提供所需的100个工作线程/线程。


  3. 分析器报告的内核性能为47.4毫秒,现在介于原始版本(约1.224秒)和大规模并行的 vectorize 版本(约0.001秒)之间。因此,从1名工人增加到100名工人可以大大加快工作速度,但可能会提高性能。如果您知道如何在 countCol 维度上进行切片,则可能会更接近 vectorize 版本,性能-明智的(见下文)。请注意,我们在此处的位置(〜47ms)与矢量化版本(〜1ms)之间的差额足以弥补传输稍大的<$ c $的额外传输成本(〜5ms或更短) c> multBy 到设备的矩阵,以方便 vectorize 简单。


关于python时序的一些其他注释:我相信python如何为原始,向量化和guvectorize改进版本编译必要内核的确切行为是不同的。如果我们修改t15.py代码以运行热身运行,那么至少python时序是一致的,从趋势的角度来看,它与总体运行时间和仅内核时序有关:

  $ cat t15.py 
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@ vectorize(['float64(float64,float64)'],target ='cuda')
def cVestDiscount(a,b):
返回a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
折扣= np.float64(np.arange(2000000 ).reshape(100,4000,5))
multBy = np.full_like(折扣,1)
cv = np.empty_like(折扣)
#热身运行
cv = cVestDiscount(multBy,折扣)
func_start = timer()
cv = cVestDiscount(multBy,折扣)
Timing = timer()-func_start
print( Function:Discount因子cumVest持续时间(秒): + str(timing))
[bob @ cluster2 python] $时间python t14.py
功能:折扣cumVest d排尿(秒):1.24376320839

实际0m2.522s
用户0m1.572s
sys 0m0.809s
$ time python t15.py
功能:折扣系数cumVest持续时间(秒):0.0228319168091

实际0m1.050s
用户0m0.473s
sys 0m0.445s
$时间python t16.py
功能:折扣系数cumVest持续时间(秒):0.0665760040283

实数0m1.252s
用户0m0.680s
sys 0m0.441s
$

现在,在回应评论中的一个问题时,实际上是:我将如何重铸该问题以沿着4000( countCol 还是中间)维?



我们可以通过沿第一个维进行切片的方法来进行指导。一种可能的方法是重新排列数组的形状,以使4000维为第一维,然后将其删除,这与我们之前对 guvectorize 的处理类似。这是一个可行的示例:

  $ cat t17.py 
numpy as np
from numba import guvectorize
从timeit导入default_timer作为计时器的导入时间
作为计时器


@guvectorize(['void(int64,float64 [:],float64 [:,:],int64 ,float64 [:,:])'],'(),(o),(m,o),()->(m,o)',target ='cuda',nopython = True)
def cVestDiscount(countCol,multBy,Discount,n,cv):
用于范围(0,n)中的ID:
用于范围在(0,n)中的num:
cv [ID ] [num] = multBy [num] *折扣[ID] [num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000)。 reshape(4000,5))
折扣= np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape =(4000,100,5),dtype = np.float64)
func_start = timer()
cv = cVestDiscount(countRow ,multBy,Discount,n,cv)
Timing = timer()-func_start
print( Function:Discount factor cumVest duration(secon ds): + str(timing))
[bob @ cluster2 python] $ python t17.py
功能:折扣系数cumVest持续时间(秒):0.0266749858856
$ nvprof --print- gpu-trace python t17.py
== 8544 == NVPROF正在分析过程8544,命令:python t17.py
功能:折扣系数cumVest持续时间(秒):0.0268459320068
== 8544 ==分析应用程序:python t17.py
== 8544 ==分析结果:
开始持续时间网格大小块大小Regs * SSMem * DSMem *大小吞吐量SrcMemType DstMemType设备上下文流名称
304.92 ms 1.1840us------8B 6.4437MB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
305.36ms 27.392us-----156.25KB 5.4400GB / s可分页设备Quadro K2000( 0 1 7 [CUDA memcpy HtoD]
306.08ms 6.0208 ms-----15.259MB 2.4749GB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
312.44ms 1.0880us-----8B 7.0123MB / s可分页设备Quadro K2000(0 1 7 [CUDA memcpy HtoD]
313.59ms 8.9961ms(63 1 1)(64 1 1)63 0B 0B----Quadro K2000(0 1 7 cudapy :: __ main __ :: __ gufunc_cVestDiscount $ 242(Array< __ int64, int = 1,A,可变,对齐,数组)[35]
322.59ms 7.2772ms-----15.259MB 2.0476GB / s设备可分页Quadro K2000(0 1 7 [CUDA内存DtoH]

Regs:每个CUDA线程使用的寄存器数。该数字包括CUDA驱动程序和/或工具内部使用的寄存器,并且可能比编译器显示的更多。
SSMem:每个CUDA块分配的静态共享内存。
DSMem:每个CUDA块分配的动态共享内存。
SrcMemType:内存操作/副本访问的源内存的类型
DstMemType:内存操作/副本访问的目标内存的类型
$


可以预见的是,我们观察到执行时间从切成100个工人时的约47ms减少到切成4000个工人时的9ms。同样,我们观察到numba选择增加63个64个线程的块,每个线程总共4032个线程,以处理此切片所需的4000个工人。



仍然不及〜1ms vectorize 内核(后者为工作人员提供了更多可用的并行切片)速度快,但比建议的〜1.2s内核要快得多在原始问题中。而且即使有所有的python开销,python代码的总挂壁时间也要快大约2倍。



作为最后的观察,让我们回顾一下我之前所做的声明(类似于注释和其他答案中的声明):


我怀疑是否有可能超过油井的性能编写的主机代码(例如,使用某种并行化方法,例如guvectorize)执行相同的操作。


我们现在有一个方便的测试我们可以使用t16.py或t17.py中的情况进行测试。为简单起见,我将选择t16.py。我们只需从 guvectorize ufunc中删除目标名称即可将其转换回CPU代码:

  $ cat t16a.py 
numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64 [:,:],float64 [:,:],int64,int64,float64 [:,:])'],'(m,o, (m,o),(),()->(m,o)')
def cVestDiscount(multBy,Discount,n,countCol,cv):
用于范围在(0, countCol):
表示范围(0,n)中的num:
cv [ID] [num] = multBy [ID] [num] * discount [ID] [num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
折扣= np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape =(100,4000,5),dtype = np.float64)
func_start = timer()
cv = cVestDiscount(multBy,Discount,n,countCol,cv)
Timing(timer()-func_start
print( F函数:折扣因子cumVest持续时间(秒): + str(timing))
$时间python t16a.py
功能:折扣因子cumVest持续时间(秒):0.00657796859741

真实0m0.528s
用户0m0.474s
sys 0m0.047s
$

因此,我们看到此仅CPU版本在大约6毫秒内运行了该功能,并且它没有GPU开销(例如CUDA初始化)以及与GPU之间的数据复制。总的挂墙时间也是我们最好的度量,大约是0.5s,而我们最好的GPU情况是大约1.0s。因此,由于每字节数据传输的算术强度低,所以这个特殊问题可能不太适合GPU计算。


Notable details

  • Large datasets (10 million x 5), (200 x 10 million x 5)
  • Numpy mostly
  • Takes longer after every run
  • Using Spyder3
  • Windows 10

First thing is attempting to use guvectorize with the following function. I am passing in a bunch of numpy arrays and attempting to use them to multiply across two of the arrays. This works if run with a target other than cuda. However, when switched to cuda it results in an unknown error being:

File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)

TypeError: init() got an unexpected keyword argument 'debug'

After following all that I could find from this error, I hit nothing but dead ends. I'm guessing it's a really simple fix that I'm completely missing but oh well. It should also be said that this error only occurs after running it once and having it crash due to memory overload.

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

All of the arrays are numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

Attempting to run the code with nvprofiler in command line results in the following error:

Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this ?multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

I realized that I am using SLI enabled graphics cards (both cards are identical, evga gtx 1080ti, and have the same device id), so I disabled SLI and added the "CUDA_VISIBLE_DEVICES" line to try and limit to other one card, but am left with the same results.

I can still run the code with nvprof, but the cuda function is slow compared to njit(parallel=True) and prange. By using a smaller data size we can run the code, but it is slower than target='parallel' and target='cpu'.

Why is cuda so much slower, and what do these errors mean?

Thanks for the help!

EDIT: Here is a working example of the code:

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

I am able to run the code in cuda using a gtx 1080ti, however, it is much slower than running it in parallel or cpu. I've looked at other posts pertaining to guvectorize, yet none of them have helped me understand what is and isn't optimal to run in guvectorize. Is there any way to make this code 'cuda friendly', or is only doing multiplication across arrays too simple for any benefit to be seen?

解决方案

First of all, the basic operation you have shown is to take two matrices, transfer them to the GPU, do some elementwise multiplications to produce a 3rd array, and pass that 3rd array back to the host.

It may be possible to make a numba/cuda guvectorize (or cuda.jit kernel) implementation that might run faster than a naive serial python implementation, but I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing. This is because the arithmetic intensity per byte transferred between host and device is just too low. This operation is far too simple.

Secondly, it's important, I believe, to start out with an understanding of what numba vectorize and guvectorize are intended to do. The basic principle is to write the ufunc definition from the standpoint of "what will a worker do?" and then allow numba to spin up multiple workers from that. The way that you instruct numba to spin up multiple workers is to pass a data set that is larger than the signatures you have given. It should be noted that numba does not know how to parallelize a for-loop inside a ufunc definition. It gets parallel "strength" by taking your ufunc definition and running it among parallel workers, where each worker handles a "slice" of the data, but runs your entire ufunc definition on that slice. As some additional reading, I've covered some of this ground here also.

So a problem we have in your realization is that you have written a signature (and ufunc) which maps the entire input data set to a single worker. As @talonmies showed, your underlying kernel is being spun up with a total of 64 threads/workers (which is far to small to be interesting on a GPU, even apart from the above statements about arithmetic intensity), but I suspect in fact that 64 is actually just a numba minimum threadblock size, and in fact only 1 thread in that threadblock is doing any useful calculation work. That one thread is executing your entire ufunc, including all for-loops, in a serial fashion.

That's obviously not what anyone would intend for rational use of vectorize or guvectorize.

So let's revisit what you are trying to do. Ultimately your ufunc wants to multiply an input value from one array by an input value from another array and store the result in a 3rd array. We want to repeat that process many times. If all 3 array sizes were the same, we could actually realize this with vectorize and would not even have to resort to the more complicated guvectorize. Let's compare that approach to your original, focusing on the CUDA kernel execution. Here's a worked example, where t14.py is your original code, run with the profiler, and t15.py is a vectorize version of it, acknowledging that we have changed the size of your multBy array to match cv and discount:

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

We see that your application reported a run-time of about 1.244 seconds, whereas the vectorize version reports a runtime of about 0.375 seconds. But there is python overhead in both of these numbers. If we look at the generated CUDA kernel duration in the profiler, the difference is even more stark. We see that the original kernel took about 1.225 seconds whereas the vectorize kernel executes in about 842 microseconds (i.e. less than 1 millisecond). We also note that the computation kernel time is now much, much smaller than the time it takes to transfer the 3 arrays to/from the GPU (which takes about 20 millisconds total) and we note that the kernel dimensions are now 15625 blocks of 128 threads each for a total thread/worker count of 2000000, exactly matching the total number of multiply operations to be done, and substantially more than the paltry 64 threads (and possibly, really only 1 thread) in action with your original code.

Given the simplicity of the above vectorize approach, if what you really want to do is this element-wise multiplication, then you might consider just replicating multBy so that it is dimensionally matching the other two arrays, and be done with it.

But the question remains: how to handle dissimilar input array sizes, as in the original problem? For that I think we need to go to guvectorize (or, as @talonmies indicated, write your own @cuda.jit kernel, which is probably the best advice, notwithstanding the possibility that none of these approaches may overcome the overhead of transferring data to/from the device, as already mentioned).

In order to tackle this with guvectorize, we need to think more carefully about the "slicing" concept already mentioned. Let's re-write your guvectorize kernel so that it only operates on a "slice" of the overall data, and then allow the guvectorize launch function to spin up multiple workers to tackle it, one worker per slice.

In CUDA, we like to have lots of workers; you really can't have too many. So this will affect how we "slice" our arrays, so as to give the possibility for multiple workers to act. If we were to slice along the 3rd (last, n) dimension, we would only have 5 slices to work with, so a maximum of 5 workers. Likewise if we slice along the first, or countRow dimension, we would have 100 slices, so a maximum of 100 workers. Ideally, we would slice along the 2nd, or countCol dimension. However for simplicity, I will slice along the first, or countRow dimension. This is clearly non-optimal, but see below for a worked example of how you might approach the slicing-by-second-dimension problem. Slicing by the first dimension means we will remove the first for-loop from our guvectorize kernel, and allow the ufunc system to parallelize along that dimension (based on sizes of arrays we pass). The code could look something like this:

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

Observations:

  1. The code changes were related to removing the countCol parameter, removing the first for-loop from the guvectorize kernel, and making the appropriate changes to the function signature to reflect this. We also modified our 3-dimensional functions in the signature to two-dimensional. We are taking a two-dimensional "slice" of the 3-dimensional data, after all, and letting each worker work on a slice.

  2. The kernel dimensions as reported by the profiler are now 2 blocks instead of 1. This makes sense, because in the original realization, there was really only 1 "slice" presented, and therefore 1 worker needed, and therefore 1 thread (but numba spun up 1 threadblock of 64 threads). In this realization, there are 100 slices, and numba chose to spin up 2 threadblocks of 64 workers/threads, to provide the needed 100 workers/threads.

  3. The kernel performance reported by the profiler of 47.4ms is now somewhere in between the original (~1.224s) and the massively parallel vectorize version (at ~0.001s). So going from 1 to 100 workers has sped things up considerably, but there are more performance gains possible. If you figure out how to slice on the countCol dimension, you can probably get closer to the vectorize version, performance-wise (see below). Note that the difference between where we are at here (~47ms) and the vectorize version (~1ms) is more than enough to make up for the additional transfer cost (~5ms, or less) of transferring a slightly larger multBy matrix to the device, to facilitate the vectorize simplicity.

Some additional comments on the python timing: I believe the exact behavior of how python is compiling the necessary kernels for the original, vectorize, and guvectorize improved versions is different. If we modify the t15.py code to run a "warm-up" run, then at least the python timing is consistent, trend-wise with the overall wall time and the kernel-only timing:

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

Now, responding to a question in the comments, effectively: "How would I recast the problem to slice along the 4000 (countCol, or "middle") dimension?"

We can be guided by what worked to slice along the first dimension. One possible approach would be to rearrange the shape of the arrays so that the 4000 dimension was the first dimension, then remove that, similar to what we did in the previous treatment of guvectorize. Here's a worked example:

$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

Somewhat predictably, we observe that the execution time has dropped from ~47ms when we sliced into 100 workers to ~9ms when we slice into 4000 workers. Similarly, we observe that numba is choosing to spin up 63 blocks of 64 threads each for a total of 4032 threads, to handle the 4000 workers needed for this "slicing".

Still not as fast as the ~1ms vectorize kernel (which has many more available parallel "slices" for workers), but quite a bit faster than the ~1.2s kernel proposed in the original question. And the overall walltime of the python code is about 2x faster, even with all the python overhead.

As a final observation, let's revisit the statement I made earlier (and is similar to statements made in the comment and in the other answer):

"I doubt it would be possible to exceed the performance of a well-written host code (e.g. using some parallelization method, such as guvectorize) to do the same thing."

We now have convenient test cases in either t16.py or t17.py that we could work with to test this. For simplicity I'll choose t16.py. We can "convert this back to a CPU code" simply by removing the target designation from the guvectorize ufunc:

$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

So we see that this CPU-only version runs the function in about 6 milliseconds, and it has no GPU "overhead" such as CUDA initialization, and copy of data to/from GPU. The overall walltime is also our best measurement, at about 0.5s compared to about 1.0s for our best GPU case. So this particular problem, due to its low arithmetic intensity per byte of data transfer, probably isn't well-suited to GPU computation.

这篇关于用于CUDA目标的Numba和guvectorize:代码运行速度比预期的慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆