人工降级CUDA计算功能以模拟其他硬件 [英] Artificially downgrade CUDA compute capabilities to simulate other hardware

查看:86
本文介绍了人工降级CUDA计算功能以模拟其他硬件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发应在具有不同内存和计算能力的多个CUDA GPU上运行的软件.我不止一次地想到客户会报告我的GPU无法再现的可重现问题.也许是因为我有8 GB的GPU内存,而他们有4 GB,也许是因为计算能力是3.0而不是2.0,诸如此类.

I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that.

因此,问题是:我可以暂时将其降级"吗?我的GPU,以便假装成较小的模型,较少的内存和/或较低的计算能力?

Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with smaller amount of memory and/or with less advanced compute capability?

每个评论都可以澄清我的要求.

Per comments clarifying what I'm asking.

假设客户报告了一个在具有计算能力 C 的GPU上运行的问题,该功能具有每块 M 个GPU内存和 T 个线程.我的机器上拥有更好的GPU,具有更高的计算能力,更多的内存和每个块更多的线程.

Suppose a customer reports a problem running on a GPU with compute capability C with M gigs of GPU memory and T threads per block. I have a better GPU on my machine, with higher compute capability, more memory, and more threads per block.

  1. 我可以在 M 个GPU内存受限的GPU上运行我的程序吗?对此的答案似乎是是的,只要在启动时分配(无论您有什么记忆)- M ,就不要使用它;否则,请不要使用它.直到程序退出之前,它只剩下 M ."

  1. Can I run my program on my GPU restricted to M gigs of GPU memory? The answer to this one seems to be "yes, just allocate (whatever mem you have) - M at startup and never use it; that would leave only M until your program exits."

是否可以在运行期间将GPU上的块大小减少到不超过 T 个线程?

Can I reduce the size of the blocks on my GPU to no more than T threads for the duration of runtime?

是否可以在程序运行期间降低GPU的计算能力?

Can I reduce compute capability of my GPU for the duration of runtime, as seen by my program?

推荐答案

我本来想对此发表评论,但对于该范围而言,它变得太大了.

I originally wanted to make this a comment but it was getting far too big for that scope.

正如@RobertCrovella所提到的那样,没有本机的方法可以完成您所要的事情.也就是说,您可以采取以下措施来最大程度地减少在其他体系结构上看到的错误.

As @RobertCrovella mentioned there is no native way to do what you are asking for. That said, you can take the following measures to minimize the bugs you see on other architectures.

0)尝试从要定位的CUDA GPU的 cudaGetDeviceProperties 中获取输出.您可以从用户或社区中大量获取.

0) Try to get the output from cudaGetDeviceProperties from the CUDA GPUs you want to target. You could crowd source this from your users or the community.

1)要限制内存,您可以实施内存管理器并手动跟踪正在使用的内存,也可以使用 cudaGetMemInfo 获得相当接近的估算值.注意:此函数还返回其他应用程序使用的内存.

1) To restrict memory, you can either implement a memory manager and manually keep track of the memory being used or use cudaGetMemInfo to get a fairly close estimate. Note: This function returns memory used by other applications as well.

2)有一个包装宏来启动内核,您可以在其中显式检查当前配置文件中的块/线程数是否合适.即不启动

2) Have a wrapper macro to launch the kernel where you can explicitly check if the number of blocks / threads fit in the current profile. i.e. Instead of launching

kernel<float><<<blocks, threads>>>(a, b, c);

您将执行以下操作:

LAUNCH_KERNEL((kernel<float>), blocks, threads, a, b, c);

可以在此处定义宏的地方:

Where you can have the macro be defined like this:

#define LAUNCH_KERNEL(kernel, blocks, threads, ...)\
        check_blocks(blocks);\
        check_threads(threads);\
        kernel<<<blocks, threads>>>(__VA_ARGS__)

3)无法降低计算能力,但是您可以使用各种计算模式编译您的代码,并确保您的内核中具有向后兼容的代码.如果您的内核中的某些部分由于较旧的计算模式而出现错误,则可以执行以下操作:

3) Reducing the compute capability is not possible, but you can however compile your code with various compute modes and make sure your kernels have backwards compatible code in them. If a certain part of your kernel errors out with an older compute mode, you can do something like this:

#if !defined(TEST_FALLBACK) && __CUDA_ARCH__ >= 300 // Or any other newer compute
// Implement using new fancy feature
#else
// Implement a fallback version
#endif

只要您想测试后备代码并确保您的代码可用于较早的计算,就可以定义 TEST_FALLBACK .

You can define TEST_FALLBACK whenever you want to test your fallback code and ensure your code works on older computes.

这篇关于人工降级CUDA计算功能以模拟其他硬件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆