在GPU上仿真具有2个FP32的FP64 [英] Emulating FP64 with 2 FP32 on a GPU

查看:386
本文介绍了在GPU上仿真具有2个FP32的FP64的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果要使用两个单精度浮点来模拟双精度浮点,性能会是什么样的,可以很好吗?

If one were to emulate double precision floating point with two single precision floating points what would the performance be like, and can it be done well?

目前Nvidia是对双精度的Tesla卡充电相当高,使你能够获得单精度性能的三分之一(值得注意的例外是Titan / Titan Black)。

Currently Nvidia is charging quite a premium for double precision enabled Tesla cards which enable you to get one third of the single precision performance (notable exceptions Titan/Titan Black).

使用具有双精度的双精度的Geforce GPU,并使用两个单精度浮点模拟双精度,性能会如何?

If one were to use a Geforce GPU with gimped double precision and emulate double precision using 2 single precision floats what would the performance be like?

推荐答案

可以通过计算实现每个双浮点运算所需的 float 操作的数量来粗略估计性能。你会想要检查二进制代码与 cuobjdump --dump-sass 获得一个准确的计数。我在下面展示了一个双浮动乘法,充分利用了GPU上的FMA(融合乘法)支持。对于双浮点添加代码,我会指向 Andrew Thall的一篇文章,因为我没有时间代码这就是现在。从以前的分析中,我相信在论文中给出的加法代码是正确的,并且它避免了更快但不太准确的实现中的常见陷阱(当操作数的幅度在二分之一内时,其失去准确性)。

You can get a rough estimate of the performance by counting the number of float operations required to implement each double-float operation. You would want to inspect binary code with cuobjdump --dump-sass to get an accurate count. I am showing a double-float multiplication below that takes full advantage of FMA (fused multiply-add) support on the GPU. For double-float addition code, I would point you to a paper by Andrew Thall as I do not have the time to code this up right now. From previous analysis I believe the addition code given in the paper is correct, and that it avoids common pitfalls in faster but less accurate implementations (which lose accuracy when the magnitude of the operands is within a factor of two).

如果您是注册的CUDA开发人员,您可以从NVIDIA开发人员网站下载双重代码(请登录 https: //developer.nvidia.com ),它是根据BSD许可证,并且相对快速地返回到双浮点代码。 NVIDIA的双重双重代码支持运算的加法,减法,除法,平方根和倒数平方根。

If you are a registered CUDA developer you can download double-double code from NVIDIA's developer website (log in at https://developer.nvidia.com) which is under BSD license, and rework it relatively quickly into double-float code. NVIDIA's double-double code supports the operations addition, subtraction, division, square root, and reciprocal square root.

如你所见,乘法需要8 float 指令;一元否定被吸收到FMA中。添加需要大约20 float 指令。然而,用于双浮动操作的指令序列也需要临时变量,这增加了寄存器压力并且可以降低占用率。因此,合理保守的估计可能是双浮点算术以原生 float 算法的吞吐量的1/20执行。您可以在与您相关的上下文(即您的使用案例)中轻松衡量这一点。

As you can see, the multiplication below requires 8 float instructions; unary negation is absorbed into FMA. The addition requires around 20 float instructions. However, the instruction sequences for double-float operations also require temporary variables, which increases register pressure and can decrease occupancy. A reasonably conservative estimate may therefore be that double-float arithmetic performs at 1/20 the throughput of native float arithmetic. You can easily measure this yourself, in the context relevant to you, i.e. your use case(s).

typedef float2 dblfloat;  // .y = head, .x = tail

__host__ __device__ __forceinline__ 
dblfloat mul_dblfloat (dblfloat x, dblfloat y)
{
    dblfloat t, z;
    float sum;
    t.y = x.y * y.y;
    t.x = fmaf (x.y, y.y, -t.y);
    t.x = fmaf (x.x, y.x, t.x);
    t.x = fmaf (x.y, y.x, t.x);
    t.x = fmaf (x.x, y.y, t.x);
    /* normalize result */
    sum = t.y + t.x;
    z.x = (t.y - sum) + t.x;
    z.y = sum;
    return z;
}

请注意,在各种应用程序中,可能不需要完全双浮点运算。相反,可以使用 float 计算,通过错误补偿技术来增强,其中最古老的是 Kahan求和。我在最近发布在NVIDIA开发人员论坛。在上面的评论中,Robert Crovella还指出了Scott LeGrand的 GTC 2015演讲

Note that in various applications, full double-float arithmetic may not be necessary. Instead one can use float computation, augmented by error compensating techniques, one of the oldest of which is the Kahan summation. I gave a brief overview of easily available literature on such methods in a recent posting in the NVIDIA developer forums. In the comments above, Robert Crovella also pointed to a GTC 2015 talk by Scott LeGrand, which I haven't had time to check out yet.

对于精度,double-float的表示精度为49(24 + 24 + 1)位,比较其中提供53位的IEEE-755 double 。然而,双浮点不能维持幅值小的操作数的精度,因为尾部可以变为非正规或零。当打开反规范支持时,对于2 -101 <= | x | < 2 128 。默认情况下,在CUDA工具链中,对于架构> = sm_20,打开了 float 的异常支持,这意味着当前运输版本CUDA 7.0支持的所有体系结构。

As for accuracy, double-float has a representational precision of 49 (24+24+1) bits, compared with IEEE-755 double which provides 53 bits. However double-float cannot maintain this precision for operands small in magnitude, as the tail portion can become a denormal or zero. When denormal support is turned on, the 49 bits of precision are guaranteed for 2-101 <= |x| < 2128. Denormal support for float is turned on by default in the CUDA tool chain for architectures >= sm_20, which means all architectures supported by the currently shipping version, CUDA 7.0.

与在IEEE-754 double 数据上的操作相反,双浮点操作未正确舍入。对于上面的双浮点乘法,使用20亿个随机测试用例(所有源操作数和结果在上述边界内),我观察到相对误差的上限为1.42e-14。我没有双浮点数的数据,但它的误差界限应该是类似的。

As opposed to operations on IEEE-754 double data, double-float operations are not correctly rounded. For the double-float multiplication above, using 2 billion random test cases (with all source operands and results within the bounds stated above), I observed an upper bound of 1.42e-14 for the relative error. I do not have data for the double-float addition, but its error bound should be similar.

这篇关于在GPU上仿真具有2个FP32的FP64的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆