OpenCL和CUDA注册用法优化 [英] OpenCL and CUDA registers usage optimization

查看:121
本文介绍了OpenCL和CUDA注册用法优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在编写OpenCL内核(但​​我想在CUDA中将是相同的),并且目前我正在尝试针对NVidia GPU进行优化.

I'm currently writing an OpenCL kernel (but I suppose that in CUDA in will be the same), and currently I try to optimize for NVidia GPU.

我目前在内核中使用63个寄存器,该内核非常大,因此它使用了所有GPU寄存器.我正在寻找以下方法:

I currently use 63 registers in my kernel, this kernel is very big and so it use all the GPU registers. I'm looking for some way to:

1)看看哪些变量在寄存器中,然后哪些在全局存储器中(因为如果我没有足够的寄存器,似乎编译器会将变量保存在全局存储器中).

1) See which variables are in registers and which are then in global memory (Because if I have not enough registers it seems the compiler save the variables in global memory).

2)有没有一种方法可以指定哪个变量更重要(或者哪个变量应该在寄存器中).因为我使用了一些存在的变量,但使用较少.一种赋予优先权的方法?

2) Is there a way to specify which variable is more important (or which should be in registers). Because I use some variables that are present but less used. A way to give priority ?

当我们已经使用了所有寄存器时,还有其他优化策略吗?

Is there other optimization strategy when we already use all the registers ?

BTW:我也尝试读取PTX代码并搜索所有".reg"关键字,但是问题是PTX不可读,我不知道代码中哪个寄存器用于哪个变量.我找不到任何对应的方法!

BTW : I have also try to read the PTX code and search for all the ".reg" keywords but the problem is that the PTX is unreadable, I don't know which register is used for which variable in my code. I have'nt find any way to have the correspondance !

谢谢

推荐答案

查看哪些变量在寄存器中,然后哪些在全局中 记忆

See which variables are in registers and which are then in global memory

为此,我不知道如何检查它

For this i do not know the way how to check it, however

有没有一种方法可以指定哪个变量更重要

Is there a way to specify which variable is more important

当我看到我的寄存器溢出时(由于缺少寄存器或当我需要在本地var中使用动态索引,这是很糟糕的)时,我使用的一个技巧是显式地存储那些,我认为并非如此关键,进入本地内存(在CUDA中称为共享")

One trick that i use when i see that i have spilled registers (due to lack of them or when i need to use dynamic indexing in local vars, which is bad) is to explicitly store ones, that i think are not so critical, into local memory (called "shared" in CUDA)

例如之前:

uint16 somedata;

之后:

__local uint16 somedata[WG_SIZE]; // or __local uint someadata[16];

但是要注意,如果您的本地内存使用量将大大增加,则由于机上波前的数量会减少(例如,占用率可能会降低),您有遭受罚款的风险

but beware that if your local memory usage will be greatly increased you are risking to have penalty because number of inflight wavefronts will be less ( i.e. you might have lower occupancy)

希望这会有所帮助.

这篇关于OpenCL和CUDA注册用法优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆