MultiGPU实现200万粒子计算 [英] MultiGPU implementation in 2 million particle calculation

查看:152
本文介绍了MultiGPU实现200万粒子计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想增加粒子数计算。到目前为止,我可以使用单GPU计算100万个粒子。是否有可能使用多GPU将粒子计算增加到200万?

I want to increase the particle number calculation. Up to now I am being able to calculate 1 million particle using single GPU. Is it possible to increase the particle calculation to 2 million using multi GPU?

推荐答案

如果您可以将算法并行化并映射到GPU上,那么您可以轻松地执行将其映射到多个GPU的类似任务。



您可以通过将一半的工作分配给GPU1而另一半分配给GPU2来完成此操作。只需要使用流来重叠两个GPU的工作时间线。这样,如果GPU具有同等的性能,您可以将总计算时间减少50%。



如果GPU不同,那么您需要解决哪个GPU获得哪个百分比问题并不难解决。



根据您选择的分而治之方式,GPU的内核可能看起来不同或相同。输入数据可以相同,输出可以不同。使用统一内存也可以让内存可见性变得更容易。



例如,我写了一个暴力nbody内核(用于64k粒子)在两个Quadro上运行K420(cc3.0)GPU具有405 GFLOPS性能(峰值的60%)。



可以帮到你的东西:



- 合作内核

- 统一内存

- 内核和缓冲区副本上的流显式设备选择



如果你有像我这样的Kepler cc3.0,那么你可以尝试旧的方法(按设备拆分内核和缓冲区并明确管理它们)。



这些是什么样的粒子?具有低力范围的流体颗粒?引力相互作用的粒子具有长程力?情况完全不同?每个更新是否需要数百个具有全局数据同步的内核调用,或者只需要几个内核,所有工作项之间只有1个数据同步?对于某些算法,增加的GPU数量不能在性能上呈线性增长,而有些则具有良好的扩展性。你在内核中每字节计算多少?它是否使用原子?
If you can parallelize an algorithm and map onto a GPU, then you can easily do a similar task to map it to multiple GPUs.

You can simply do this by allocating half of work to GPU1 and the other half to GPU2. Just need to use streams to overlap two GPUs working timeline. This way you can reduce total time to compute by 50% if GPUs are equally performant.

If GPUs are different, then you will need to solve which GPU gets which percentage problem and its not too hard to solve.

Depending on the "divide and conquer" way you choose, kernels of GPUs may look different or same. Input data could be same, output could be different. Also memory visibility things can get easier by using unified memory.

For example, I wrote a brute-force nbody kernel (for 64k particles) to run on two Quadro-K420(cc3.0) GPUs and had 405 GFLOPS performance (60% of peak) out of them.

Things that can help you:

- cooperative kernels
- unified memory
- explicit device selection by streams on kernels and buffer copies

If you have a Kepler cc3.0 like me, then you can try the old way (splitting kernels and buffers per-device and managing them explicitly).

What kind of particles are those? Fluid particles with low range of forces? Gravitationally interacting particles that have long-range forces? Totally different scenario? Does each "update" needs hundreds of kernel calls with global data synchronization or just a few kernels with just 1 data sync between all workitems? For some algorithms, increasing number of GPUs can't grow linearly in performance while some have good scaling. How much calculation per byte are you making in kernel? Does it use atomics?


这篇关于MultiGPU实现200万粒子计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆