cuda上的128位整数? [英] 128 bit integer on cuda?

查看:16
本文介绍了cuda上的128位整数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚设法在 Linux Ubuntu 10.04 下安装了我的 cuda SDK.我的显卡是 NVIDIA geForce GT 425M,我想用它来解决一些繁重的计算问题.我想知道的是:有没有办法使用一些无符号的 128 位 int var?当使用 gcc 在 CPU 上运行我的程序时,我使用的是 __uint128_t 类型,但是将它与 cuda 一起使用似乎不起作用.有什么办法可以在 cuda 上使用 128 位整数吗?

I just managed to install my cuda SDK under Linux Ubuntu 10.04. My graphic card is an NVIDIA geForce GT 425M, and I'd like to use it for some heavy computational problem. What I wonder is: is there any way to use some unsigned 128 bit int var? When using gcc to run my program on the CPU, I was using the __uint128_t type, but using it with cuda doesn't seem to work. Is there anything I can do to have 128 bit integers on cuda?

推荐答案

为了获得最佳性能,需要将 128 位类型映射到合适的 CUDA 向量类型(例如 uint4)之上,并使用 PTX 实现功能内联汇编.添加的内容如下所示:

For best performance, one would want to map the 128-bit type on top of a suitable CUDA vector type, such as uint4, and implement the functionality using PTX inline assembly. The addition would look something like this:

typedef uint4 my_uint128_t;
__device__ my_uint128_t add_uint128 (my_uint128_t addend, my_uint128_t augend)
{
    my_uint128_t res;
    asm ("add.cc.u32      %0, %4, %8;
	"
         "addc.cc.u32     %1, %5, %9;
	"
         "addc.cc.u32     %2, %6, %10;
	"
         "addc.u32        %3, %7, %11;
	"
         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
         : "r"(addend.x), "r"(addend.y), "r"(addend.z), "r"(addend.w),
           "r"(augend.x), "r"(augend.y), "r"(augend.z), "r"(augend.w));
    return res;
}

通过将 128 位数字分解为 32 位块,计算 64 位部分乘积并适当地相加,可以类似地使用 PTX 内联汇编来构造乘法.显然,这需要一些工作.通过将数字分解为 64 位块并将 __umul64hi() 与常规 64 位乘法和一些加法结合使用,可以在 C 级别获得合理的性能.这将导致以下结果:

The multiplication can similarly be constructed using PTX inline assembly by breaking the 128-bit numbers into 32-bit chunks, computing the 64-bit partial products and adding them appropriately. Obviously this takes a bit of work. One might get reasonable performance at the C level by breaking the number into 64-bit chunks and using __umul64hi() in conjuction with regular 64-bit multiplication and some additions. This would result in the following:

__device__ my_uint128_t mul_uint128 (my_uint128_t multiplicand, 
                                     my_uint128_t multiplier)
{
    my_uint128_t res;
    unsigned long long ahi, alo, bhi, blo, phi, plo;
    alo = ((unsigned long long)multiplicand.y << 32) | multiplicand.x;
    ahi = ((unsigned long long)multiplicand.w << 32) | multiplicand.z;
    blo = ((unsigned long long)multiplier.y << 32) | multiplier.x;
    bhi = ((unsigned long long)multiplier.w << 32) | multiplier.z;
    plo = alo * blo;
    phi = __umul64hi (alo, blo) + alo * bhi + ahi * blo;
    res.x = (unsigned int)(plo & 0xffffffff);
    res.y = (unsigned int)(plo >> 32);
    res.z = (unsigned int)(phi & 0xffffffff);
    res.w = (unsigned int)(phi >> 32);
    return res;
}

以下是使用 PTX 内联汇编的 128 位乘法版本.它需要 CUDA 4.2 附带的 PTX 3.0,并且代码需要至少具有 2.0 计算能力的 GPU,即 Fermi 或 Kepler 类设备.该代码使用最少数量的指令,因为需要 16 次 32 位乘法来实现 128 位乘法.相比之下,上面使用 CUDA 内部函数的变体针对 sm_20 目标编译为 23 条指令.

Below is a version of the 128-bit multiplication that uses PTX inline assembly. It requires PTX 3.0, which shipped with CUDA 4.2, and the code requires a GPU with at least compute capability 2.0, i.e. a Fermi or Kepler class device. The code uses the minimal number of instructions, as sixteen 32-bit multiplies are needed to implement a 128-bit multiplication. By comparison, the variant above using CUDA intrinsics compiles to 23 instructions for an sm_20 target.

__device__ my_uint128_t mul_uint128 (my_uint128_t a, my_uint128_t b)
{
    my_uint128_t res;
    asm ("{
	"
         "mul.lo.u32      %0, %4, %8;    
	"
         "mul.hi.u32      %1, %4, %8;    
	"
         "mad.lo.cc.u32   %1, %4, %9, %1;
	"
         "madc.hi.u32     %2, %4, %9,  0;
	"
         "mad.lo.cc.u32   %1, %5, %8, %1;
	"
         "madc.hi.cc.u32  %2, %5, %8, %2;
	"
         "madc.hi.u32     %3, %4,%10,  0;
	"
         "mad.lo.cc.u32   %2, %4,%10, %2;
	"
         "madc.hi.u32     %3, %5, %9, %3;
	"
         "mad.lo.cc.u32   %2, %5, %9, %2;
	"
         "madc.hi.u32     %3, %6, %8, %3;
	"
         "mad.lo.cc.u32   %2, %6, %8, %2;
	"
         "madc.lo.u32     %3, %4,%11, %3;
	"
         "mad.lo.u32      %3, %5,%10, %3;
	"
         "mad.lo.u32      %3, %6, %9, %3;
	"
         "mad.lo.u32      %3, %7, %8, %3;
	"
         "}"
         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
         : "r"(a.x), "r"(a.y), "r"(a.z), "r"(a.w),
           "r"(b.x), "r"(b.y), "r"(b.z), "r"(b.w));
    return res;
}

这篇关于cuda上的128位整数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆