x64允许比Win32更少的每个块的线程? [英] x64 allows less threads per block than Win32?

查看:134
本文介绍了x64允许比Win32更少的每个块的线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我执行一些cuda内核,我注意到,对于我自己的许多cuda内核,x64 build会导致失败,而Win32不会。

When I am executing some cuda kernel, I noticed that for the many of my own cuda kernels, x64 build would cause failure, whereas Win32 would not.

我很困惑,因为cuda源代码是相同的,构建是好的。它只是当x64执行时,它说它请求太多的资源启动。但是不应该x64允许比Win32更多的资源在概念上?

I am very confused because the cuda source code are the same, and build is fine. It is just when x64 executes, it says it requests too much resource to launch. But shouldn't x64 allows more resources than Win32 in conceptually?

我通常喜欢使用1024个线程每个块,如果可能的话。所以为了使x64代码工作,我必须将块缩小到256。

I normally like to use 1024 threads per block if it is possible. So to make x64 code work, I have to downsize the block to 256.

任何人都有什么想法?

推荐答案

是的,这是可能的。假设您正在讨论的问题是每个线程的注册表问题

Yes, it's possible. Presumably the issue you are talking about is a registers-per-thread issue.

在32位模式下,所有指针都是32位,并且只需要一个32位寄存器存储在GPU上。使用完全相同的源代码,这些指针将需要64位用于存储,因此将有效地需要两个32位寄存器(并且,如下面所指出的,某些其他类型也可以改变它们的大小,需要双倍的寄存器。)可用的32位寄存器的数量是硬件限制,不会改变是为32位还是64位模式编译,但指针存储将使用64位模式下的寄存器的两倍。

In 32-bit mode, all pointers are 32-bits and require only one 32-bit register for storage on the GPU. With the exact same source code, those pointers will require 64-bits for storage and therefore will effectively require two 32-bit registers (and, as @njuffa points out below, certain other types can change their size as well, requiring double the registers.) The number of available 32-bit registers is a hardware limit that does not change whether compiling for 32-bit or 64-bit mode, but pointer storage will use twice as many registers in 64-bit mode.

指针算术(或涉及任何大小增加的类型的算术)可能同样受到影响,因为其中一些可能需要使用64位算法与32位算术。

Pointer arithmetic (or arithmetic involving any of the types that increase in size) may likewise be impacted, as some of it may need to be done using 64-bit arithmetic vs. 32-bit arithmetic.

如果这些寄存器每线程在64位模式下增加,那么您的整体使用超过限制,那么您将必须使用多种方法之一管理它。你已经提到了一个:减少线程数。您还可以调查 nvcc -maxrregcount ... switch ,和/或 launch bounds directive

If these registers-per-thread increases in 64-bit mode place your overall usage over the limit, then you will have to use one of a variety of methods to manage it. You've mentioned one already: reduce the number of threads. You can also investigate the nvcc -maxrregcount ... switch, and/or the launch bounds directive.

这篇关于x64允许比Win32更少的每个块的线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆