在OpenCL内核中实现小型查找表的最佳方法是什么 [英] What is the best way to implement a small lookup table in an OpenCL Kernel

查看:70
本文介绍了在OpenCL内核中实现小型查找表的最佳方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的内核中,有必要对一个小的查找表进行大量随机访问(仅8个32位整数).每个内核都有一个唯一的查找表.下面是内核的简化版本,用于说明如何使用查找表.

In my kernel it is necessary to make a large number of random accesses to a small lookup table (only 8 32-bit integers). Each kernel has a unique lookup table. Below is a simplified version of the kernel to illustrate how the lookup table is used.

__kernel void some_kernel(  
    __global uint* global_table,
    __global uint* X,
    __global uint* Y) {

    size_t gsi = get_global_size(0);
    size_t gid = get_global_id(0);

    __private uint LUT[8]; // 8 words of of global_table is copied to LUT

    // Y is assigned a value from the lookup table based on the current value of X
    for (size_t i = 0; i < n; i++) {
        Y[i*gsi+gid] = LUT[X[i*gsi+gid]];
    }   
}

由于尺寸较小,因此通过将表保留在__private内存空间中,可以获得最佳性能.但是,由于访问查找表的随机性,仍然会对性能造成很大的影响.删除查找表代码(例如,用简单的算术运算代替),尽管内核会提供错误的答案,但性能却提高了3倍以上.

Because of the small size, I am getting the best performance by keeping the table in the __private memory space. However, because of the random nature in which the lookup table is accessed, there is still a large performance hit. With the lookup table code removed (replaced with a simple arithmetic operation, for example), although the kernel would provide the wrong answer, the performance improves by a factor of over 3.

有更好的方法吗?我是否忽略了一些OpenCL功能,该功能可为非常小的内存块提供有效的随机访问?使用向量类型是否可以找到有效的解决方案?

Is there a better way? Have I overlooked some OpenCL feature that provides efficient random access for very small chunks of memory? Could there be an efficient solution using vector types?

[edit]请注意,X的最大值为7,但Y的最大值最大为2 ^ 32-1.换句话说,查找表的所有位都在使用中,因此不能打包成较小的表示形式.

[edit] Note, that the maximum value of X is 7, but the maximum value of Y is as large as 2^32-1. In other words, all the bits of the lookup table are being used, so it cannot be packed into a smaller representation.

推荐答案

我能想到的最快的解决方案是首先不要使用数组:而是使用单个变量,并使用某种访问函数来访问它们,就像他们是一个数组. IIRC(至少对于AMD编译器而言,但我很确定NVidia也是如此):通常,数组始终存储在内存中,而标量可以存储在寄存器中. (但是我对此事有点模糊-我可能错了!)

The fastest solution I can think of is to not use arrays in the first place: use individual variables instead and use some sort of access function to access them as if they were an array. IIRC (at least for the AMD compiler but I'm pretty sure this is true for NVidia as well): generally, arrays are always stored in memory, while scalars may be stored in registers. (But my mind is a little fuzzy on the matter — I might be wrong!)

即使您需要一个巨大的switch语句:

Even if you need a giant switch statement:

uint4 arr0123, arr4567;
uint getLUT(int x) {
    switch (x) {
    case 0: return arr0123.r0;
    case 1: return arr0123.r1;
    case 2: return arr0123.r2;
    case 3: return arr0123.r3;
    case 4: return arr4567.r0;
    case 5: return arr4567.r1;
    case 6: return arr4567.r2;
    case 7: default: return arr4567.r3;
    }
}

...与__private数组相比,您可能仍然在性能上领先,因为假设所有适合寄存器的arr变量都是ALU绑定的. (当然,假设您有足够的备用寄存器用于arr变量.)

... you might still come out ahead in performance compared to a __private array, since, assuming the arr variables all fit in registers is purely ALU-bound. (Assuming you have enough spare registers for the arr variables, of course.)

请注意,某些OpenCL目标甚至没有拥有私有内存,并且您在那里声明的所有内容都只用于__global.在那里使用寄存器存储是一个更大的胜利.

Note, some OpenCL targets don't even have private memory, and anything you declare there just goes to __global. Using register storage is an even bigger win there.

当然,这种LUT方法的初始化速度可能较慢,因为您将需要至少两次单独的内存读取才能从全局内存中复制LUT数据.

Of course, this LUT approach is likely to be slower to initialize, since you will need at least two separate memory reads to copy the LUT data from global memory.

这篇关于在OpenCL内核中实现小型查找表的最佳方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆