计算CUFFT的性能 [英] Calculating performance of CUFFT
问题描述
我运行CUFFT上的块(N * N / p)分为多个GPU,我有一个问题,计算的性能。首先,了解我的操作方式:
I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. First, a bit about how I am doing it:
- 向每个GPU发送N * N / p个块
- 对p个GPU中的每一行进行成批的1-D FFT
- 获取N * N / p个块返回主机 - 对整个数据集执行转置
- 同上第1步
- 同上第2步
- Send N*N/p chunks to each GPU
- Batched 1-D FFT for each row in p GPUs
- Get N*N/p chunks back to host - perform transpose on the entire dataset
- Ditto Step 1
- Ditto Step 2
Gflops =(1e-9 * 5 * N * N * lg(N * N))/执行时间
as:
执行时间= Sum(每个GPU的行和列FFT的memcpyHtoD + kernel + memcpyDtoH次)
这是在多个GPU上评估CUFFT性能的正确方法吗?是否有其他方法可以表示FFT的性能?
Is this the correct way to evaluate CUFFT performance on multiple GPUs? Is there any other way I could represent the performance of FFT?
谢谢。
推荐答案
如果要进行复杂变换,操作计数是正确的(对于实值变换,它应该为2.5 N log2(N)),但GFLOP公式不正确。在并行多处理器操作中,通常的吞吐量计算是
If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is
operation count / wall clock time
在你的情况下,假设GPU并行运行,测量挂钟时间(即整个操作花费多长时间)执行时间或使用:
In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:
execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)
因为它是,你的计算代表串行执行时间。考虑到multigpu方案的开销,我预计您获得的计算性能数值会比单个GPU上的等效变换低。
As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.
这篇关于计算CUFFT的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!