扭曲如何与原子操作一起工作? [英] How does warp work with atomic operation?

查看:71
本文介绍了扭曲如何与原子操作一起工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

warp中的线程在物理上并行运行,因此,如果其中一个线程(称为线程X)启动原子操作,那么其他线程又会做什么呢?等待?难道是说,当线程X被推到原子队列时,所有线程都将等待,获取访问权(互斥体),并用内存做一些处理(该内存受该互斥体保护,然后再进行Realese互斥体处理)?

The threads in a warp run physically parallel, so if one of them (called, thread X) start an atomic operation, what other will do? Wait? Is it mean, all threads will be waiting while thread X is pushed to the atomic-queue, get the access (mutex) and do some stuff with memory, which was protected with that mutex, and realese mutex after?

是否有其他方法可以占用其他线程来完成某些工作,例如读取一些内存,因此原子操作将隐藏它的延迟?我的意思是说,这是15个空闲线程.原子真的很慢,对吗?我该如何加速?有没有可以使用的模式?

Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency? I mean, 15 idle threads it's.. not well, I guess. Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?

使用共享内存锁对存储库或整个内存进行原子操作吗? 例如(没有互斥锁),有__shared__ float smem[256];

Does atomic operation with shared memory lock for a bank or whole memory? For example (without mutexs), there is __shared__ float smem[256];

  • 线程1运行atomicAdd(smem, 1);
  • Thread2运行atomicAdd(smem + 1, 1);
  • Thread1 runs atomicAdd(smem, 1);
  • Thread2 runs atomicAdd(smem + 1, 1);

这些线程与不同的存储体一起工作,但通常使用共享内存.他们会并行运行还是会排队?如果Thread1和Thread2来自分开的经线或一般经线,则此示例有什么区别?

Those threads works with different banks, but in general shared memory. Does they run parralel or they will be queued? Is there any difference with this example, if Thread1 and Thread2 are from separated warps or general one?

推荐答案

我数了10个问题.这使得很难回答.建议您每个问题问一个问题.

I count something like 10 questions. It makes it quite difficult to answer. It's suggested you ask one question per question.

通常来说,warp中的所有线程都在执行相同的指令流.因此,我们可以考虑两种情况:

Generally speaking, all threads in a warp are executing the same instruction stream. So there are two cases we can consider:

  1. 没有条件(例如if ... then ... else).在这种情况下,所有线程都在执行同一条指令,而这恰好是原子指令.然后,所有32个线程将执行一个原子操作,尽管不一定要在同一位置上执行.所有这些原子都将由SM处理,并在一定程度上进行序列化(如果它们更新相同的位置,它们将完全序列化).
  2. 例如,假设我们有if (!threadIdx.x) AtomicAdd(*data, 1);,则线程0将执行原子操作,并且 其他人不会.看来我们可以让其他人去做 其他的东西,但是锁步扭曲执行不允许这样做. Warp执行序列化,以便所有采用if (true)路径的线程将一起执行,而所有执行 if (false)路径将一起执行,但对与错 路径将被序列化.再说一次,我们真的不能有不同 经线中的线程同时执行.
  1. without conditionals (e.g. if...then...else) In this case, all threads are executing the same instruction, which happens to be an atomic instruction. Then all 32 threads will execute an atomic, although not necessarily on the same location. All of these atomics will get processed by the SM, and to some extent will serialize (they will completely serialize if they are updating the same location).
  2. with conditionals For example, suppose we had if (!threadIdx.x) AtomicAdd(*data, 1); Then thread 0 would execute the atomic, and others wouldn't. It might seem like we could get the others to do something else, but the lockstep warp execution doesn't allow this. Warp execution is serialized such that all threads taking the if (true) path will execute together, and all threads executing the if (false) path will execute together, but the true and false paths will be serialized. So again, we can't really have different threads in a warp executing different instructions simultaneously.

其最终结果是,在一次扭曲中,我们不能让一个线程执行原子操作,而其他线程同时执行其他操作.

The net of it is, within a warp, we can't have one thread do an atomic while others do something else simultaneously.

您的许多其他问题似乎期望内存事务在其起源的指令周期结束时完成.事实并非如此.对于全局内存和共享内存,我们必须在代码中采取特殊步骤,以确保其他线程可以看到以前的写入事务(可以将其视为事务已完成的证据.)执行此操作的一种典型方法是使用屏障指令,例如__syncthreads()__threadfence(),但是没有那些屏障指令,线程就不会等待"写入完成.读取(取决于a的操作)可能会使线程停止运行.写操作通常不会使线程停顿.

A number of your other questions seem to expect that memory transactions are completed at the end of the instruction cycle that they originated in. This is not the case. With global and with shared memory, we must take special steps in the code to ensure that previous write transactions are visible to other threads (which could be argued as the evidence that the transaction completed.) One typical way to do this is to use barrier instructions, such as __syncthreads() or __threadfence() But without those barrier instructions, threads are not "waiting" for writes to complete. A (an operation dependent on a) read can stall a thread. A write generally cannot stall a thread.

现在让我们来看看您的问题:

Now lets see about your questions:

因此,如果其中一个开始原子操作,那么其他将做什么?等待?

so if one of them start an atomic operation, what other will do? Wait?

不,他们不等.原子操作被调度到处理原子的SM上的功能单元,并且所有线程以锁步的方式一起继续.因为一个原子通常暗示着读取,所以是的,读取可能会使扭曲停止.但是线程不会等到原子操作完成(即写操作)后再进行操作.但是,随后对该位置的读取 可能会使扭曲停止,再次等待原子(写)操作完成.对于保证更新全局内存的全局原子,它将使原始SM(如果启用)中的L1和L2(如果它们包含该位置作为条目)将失效.

No, they don't wait. The atomic operation gets dispatched to a functional unit on the SM that handles atomics, and all threads continue, together, in lockstep. Since an atomic generally implies a read, yes, the read can stall the warp. But the threads do not wait until the atomic operation is completed (i.e, the write). However, a subsequent read of this location could stall the warp, again, waiting for the atomic (write) to complete. In the case of a global atomic, which is guaranteed to update global memory, it will invalidate the L1 in the originating SM (if enabled) and the L2, if they contain that location as an entry.

是否有其他方法可以占用其他线程来完成某些工作,例如读取一些内存,因此原子操作将隐藏它的延迟?

Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency?

不是真的,由于我在开头提到的原因.

Not really, for the reasons I stated at the beginning.

原子真的很慢,对吗?我该如何加速?有没有可以使用的模式?

Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?

是的,如果原子支配了活动(例如,朴素的约简或朴素的直方图),则它们可使程序运行慢得多.通常,加速原子操作的方法是不使用或谨慎地使用它们.一种不支配程序活动的方式.例如,天真的还原将使用原子将每个元素添加到全局总和.智能并行还原将完全不使用原子来完成线程块中的工作.在线程块缩减结束时,可以使用单个原子将线程块部分和更新为全局和.这意味着我可以用大约32个原子或更少的原子数来快速并行地减少任意数量的元素.在整个程序执行过程中,这种很少使用原子的方法基本上不会引起注意,只是它可以在单个内核调用(而不是2)中进行并行缩减.

Yes, atomics can make a program run much more slowly if they dominate the activity (such as naive reductions or naive histogramming.) Generally speaking, the way to accelerate atomic operations is to not use them, or use them sparingly, in a way that doesn't dominate program activity. For example, a naive reduction would use an atomic to add every element to the global sum. A smart parallel reduction will use no atomics at all for the work done in the threadblock. At the end of the threadblock reduction, a single atomic might be used to update the threadblock partial sum into the global sum. This means that I can do a fast parallel reduction of an arbitrarily large number of elements with perhaps on the order of 32 atomic adds, or less. This sparing use of atomics will basically not be noticeable in the overall program execution, except that it enables the parallel reduction to be done in a single kernel call rather than 2.

共享内存:它们是否运行parralel或将排队?

Shared memory: Does they run parralel or they will be queued?

他们将排队.这样做的原因是,可以处理共享内存上的原子操作的功能单元数量有限,不足以在一个周期内满足来自扭曲的所有请求.

They will be queued. The reason for this is that there are a limited number of functional units that can process atomic operations on shared memory, not enough to service all the requests from a warp in a single cycle.

我避免尝试回答与原子操作的吞吐量有关的问题,因为在AFAIK文档中没有很好地指定此数据.可能是,如果您发出了足够多的同时或几乎同时进行的原子操作,由于供给原子功能单元已满的队列,某些扭曲将在原子指令上停顿.我不知道这是真的,也无法回答有关问题.

I've avoided trying to answer questions that relate to the throughput of atomic operations, because this data is not well specified in the documentation AFAIK. It may be that if you issue enough simultaneous or nearly-simultaneous atomic operations, that some warps will stall on the atomic instruction, due to the queues that feed the atomic functional units being full. I don't know this to be true and I can't answer questions about it.

这篇关于扭曲如何与原子操作一起工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆