计算CUFFT的性能 [英] Calculating performance of CUFFT

查看:525
本文介绍了计算CUFFT的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行CUFFT上的块(N * N / p)分为多个GPU,我有一个问题,计算的性能。首先,了解我的操作方式:

I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. First, a bit about how I am doing it:


  1. 向每个GPU发送N * N / p个块

  2. 对p个GPU中的每一行进行成批的1-D FFT

  3. 获取N * N / p个块返回主机 - 对整个数据集执行转置

  4. 同上第1步

  5. 同上第2步

  1. Send N*N/p chunks to each GPU
  2. Batched 1-D FFT for each row in p GPUs
  3. Get N*N/p chunks back to host - perform transpose on the entire dataset
  4. Ditto Step 1
  5. Ditto Step 2

Gflops =(1e-9 * 5 * N * N * lg(N * N))/执行时间

as:

执行时间= Sum(每个GPU的行和列FFT的memcpyHtoD + kernel + memcpyDtoH次)

这是在多个GPU上评估CUFFT性能的正确方法吗?是否有其他方法可以表示FFT的性能?

Is this the correct way to evaluate CUFFT performance on multiple GPUs? Is there any other way I could represent the performance of FFT?

谢谢。

推荐答案

如果要进行复杂变换,操作计数是正确的(对于实值变换,它应该为2.5 N log2(N)),但GFLOP公式不正确。在并行多处理器操作中,通常的吞吐量计算是

If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is

operation count / wall clock time

在你的情况下,假设GPU并行运行,测量挂钟时间(即整个操作花费多长时间)执行时间或使用:

In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:

execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

因为它是,你的计算代表串行执行时间。考虑到multigpu方案的开销,我预计您获得的计算性能数值会比单个GPU上的等效变换低。

As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.

这篇关于计算CUFFT的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆