如何编写基于LOP3的指令为Maxwell和更高的NVIDIA架构? [英] How to write LOP3 based instructions for Maxwell and up NVIDIA Architecture?

查看:911
本文介绍了如何编写基于LOP3的指令为Maxwell和更高的NVIDIA架构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Maxwell Architecture在PTX程序集中引入了一个新的指令,称为LOP3,根据 NVIDIA博客

Maxwell Architecture has introduced a new instruction in PTX assembly called LOP3 which according to the NVIDIA blog:


在多个输入上执行复杂逻辑运算
时可以保存指令。

"Can save instructions when performing complex logic operations on multiple inputs."

GTC 2016 a>,有些CUDA开发人员设法加快了Tegra X1处理器(Maxwell)的 atan2f 功能

At GTC 2016, some CUDA developers managed to accelerated the atan2f function for Tegra X1 processor (Maxwell) with such instructions.

但是,在 .cu 文件中定义的以下函数会导致 __ SET_LT __ LOP3_0xe2

However, the below function defined within a .cu file leads to undefined definitions for __SET_LT and __LOP3_0xe2.

我必须在 .ptx 文件中定义它们吗?如果是,如何?

Do I have to define them in .ptx file instead ? if so, how ?

float atan2f(const float dy, const float dx) 
{
 float flag, z = 0.0f;
 __SET_LT(flag, fabsf(dy), fabsf(dx));

 uint32_t m, t1 = 0x80000000; 
 float t2 = float(M_PI) / 2.0f;

 __LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
 float w = flag * __int_as_float(m) + float(M_PI)/2.0f; 

 float Offset = copysignf(w, dy);
 float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));

 uint32_t r, b = __float_as_int(flag) << 2;
 uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
 __LOP3_0xe2(r, mask, t1, __floast_as_int(t));

 const float p = fabsf(__int_as_float(r)) - 1.0f;
 return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}

编辑

宏定义最后是:

The macro defines are finally:

#define __SET_LT(D, A, B) asm("set.lt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __SET_GT(D, A, B) asm("set.gt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __LOP3_0x2e(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0x2e;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
#define __LOP3_0xe2(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0xe2;" : "=r"(D) : "r"(A), "r"(B), "r"(C))


推荐答案

lop3.b32 PTX指令可以执行更多或 - 对3个变量A,B和C的非任意布尔(逻辑)操作。

The lop3.b32 PTX instruction can perform a more-or-less arbitrary boolean (logical) operation on 3 variables A,B, and C.

为了设置要执行的实际操作,我们必须提供一个lookup -table立即参数( immLut - 一个8位数量)。如文档,为给定操作 F(A,B,C)计算必要的 immLut 0xF0 的值替换为 A 0xCC B 0xAA ,表示 C 。例如假设我们要计算:

In order to set the actual operation to be performed, we must provide a "lookup-table" immediate argument (immLut -- an 8-bit quantity). As indicated in the documentation, a method to compute the necessary immLut argument for a given operation F(A,B,C) is to substitute the values of 0xF0 for A, 0xCC for B, and 0xAA for C in the actual desired equation. For example suppose we want to compute:

F = (A || B) && (!C)   ((A or B) and (not-C))

immLut 参数:

immLut = (0xF0 | 0xCC) & (~0xAA)

注意 F 是一个布尔方程,处理参数 A B C 作为布尔值,并产生一个真/假结果( F )。但是,计算 immLut 的公式是一个逻辑运算。

Note that the specified equation for F is a boolean equation, treating the arguments A,B, and C as boolean values, and producing a true/false result (F). However, the equation to compute immLut is a bitwise logical operation.

上面的例子中, immLut 的计算值为 0x54

For the above example, immLut would have a computed value of 0x54

在普通CUDA C / C ++代码中使用PTX指令,可能最常见的(可以说是最简单的)方法是使用内嵌PTX 。内联PTX 已记录,其他问题也有讨论如何使用它(例如这一个),所以我不会请在此处重复。

If it's desired to use a PTX instruction in ordinary CUDA C/C++ code, probably the most common (and arguably easiest) method would be to use inline PTX. Inline PTX is documented, and there are other questions discussing how to use it (such as this one), so I'll not repeat that here.

这是上面示例案例的一个工作示例。注意,这个特定的PTX指令只能在cc5.0和更高版本的架构上使用,所以一定要编译到至少那个级别的目标。

Here is a worked example of the above example case. Note that this particular PTX instruction is only available on cc5.0 and higher architectures, so be sure to compile for at least that level of target.

$ cat t1149.cu
#include <stdio.h>

const unsigned char A_or_B_and_notC=((0xF0|0xCC)&(~0xAA));

__device__ int my_LOP_0x54(int A, int B, int C){
  int temp;
  asm("lop3.b32 %0, %1, %2, %3, 0x54;" : "=r"(temp) : "r"(A), "r"(B), "r"(C));
  return temp;
}

__global__ void testkernel(){

  printf("A=true, B=false, C=true,   F=%d\n", my_LOP_0x54(true, false, true));
  printf("A=true, B=false, C=false,  F=%d\n", my_LOP_0x54(true, false, false));
  printf("A=false, B=false, C=false, F=%d\n", my_LOP_0x54(false, false, false));
}


int main(){

  printf("0x%x\n", A_or_B_and_notC);
  testkernel<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t1149 t1149.cu
$ ./t1149
0x54
A=true, B=false, C=true,   F=0
A=true, B=false, C=false,  F=1
A=false, B=false, C=false, F=0
$

由于 immLut 是PTX代码中的立即常数,没有使用内联PTX将此作为函数参数 - 即使使用模板。根据您的提供的链接,那么该演示文稿的作者似乎也使用了一个单独定义的函数,用于特定的期望立即值 - 在它们的情况下可能是0xE2和0x2E。另外,请注意,我已经选择写我的函数,以便它返回操作的结果作为函数返回值。您链接的演示文稿的作者似乎通过一个函数参数传递返回值。任一方法应该是可行的。 (实际上,似乎他们已将 __ LOP3 ... 代码写为函数宏,而不是普通函数。)

Since immLut is an immediate constant in PTX code, I know of no way using inline PTX to pass this as a function parameter - even if templating is used. Based on your provided link, it seems that the authors of that presentation also used a separately defined function for the specific desired immediate value -- presumably 0xE2 and 0x2E in their case. Also, note that I have chosen to write my function so that it returns the result of the operation as the function return value. The authors of the presentation you linked appear to be passing the return value back via a function parameter. Either method should be workable. (In fact, it appears they have written their __LOP3... codes as functional macros rather than ordinary functions.)

这篇关于如何编写基于LOP3的指令为Maxwell和更高的NVIDIA架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆