如何从CUDA中的线程有效地收集数据? [英] How to efficiently gather data from threads in CUDA?

查看:224
本文介绍了如何从CUDA中的线程有效地收集数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序解决了CUDA中的方程组,我知道确定每个线程可以找到多达4个解决方案,但是如何复制然后回到主机?

I have a application that solves a system of equations in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy then back to the host?

我传递一个巨大的数组,有足够的空间给所有线程存储4个解决方案(每个解决方案4个解决方案),另一个解决方案每个线程的解决方案,但这是一个朴素的解决方案,是我的内核的当前瓶颈。

I'm passing a huge array with enough space to all threads store 4 solutions (4 doubles for each solution), and another one with the number of solutions per thread, however that's a naive solution, and is the current bottleneck of my kernel.

我真的很喜欢优化这个。主要问题是在单个数组中连接可变数量的每个线程的解决方案。

I really like to optimize this. The main problem is concatenate a variable number of solutions per thread in a single array.

推荐答案

称为流压缩。

您可能需要提供一个数组,每个线程包含4个解决方案的空间,因为尝试以紧凑形式直接存储结果可能会创建因此在线程之间的许多依赖性在能够将较少的数据拷贝回主机时获得的性能由于较长的内核执行时间而丢失。例外的是,如果几乎所有的线程都没有找到解决方案。在这种情况下,您可以使用原子操作来将索引维护到数组中。因此,对于找到的每个解决方案,您将在索引处将其存储在数组中,然后使用原子操作来增加索引。我认为这是安全的使用atomicAdd()。在存储结果之前,线程将使用atomicAdd()将索引增加一。 atomicAdd()返回旧值,并且线程可以使用旧值作为索引存储结果。

You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.

然而,给定更常见的情况,结果数,最佳解决方案将是作为单独的步骤执行压实操作。一种方法是使用 thrust :: copy_if 。有关更多背景信息,请参见此问题

However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

这篇关于如何从CUDA中的线程有效地收集数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆