在 GPU 上用 2 个 FP32 模拟 FP64 [英] Emulating FP64 with 2 FP32 on a GPU

查看:35
本文介绍了在 GPU 上用 2 个 FP32 模拟 FP64的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果用两个单精度浮点数来模拟双精度浮点数,性能会如何,能做好吗?

If one were to emulate double precision floating point with two single precision floating points what would the performance be like, and can it be done well?

目前,Nvidia 对支持双精度的 Tesla 卡收取相当高的费用,使您能够获得三分之一的单精度性能(Titan/Titan Black 除外).

Currently Nvidia is charging quite a premium for double precision enabled Tesla cards which enable you to get one third of the single precision performance (notable exceptions Titan/Titan Black).

如果使用具有 gimped 双精度的 Geforce GPU 并使用 2 个单精度浮点数模拟双精度,性能会如何?

If one were to use a Geforce GPU with gimped double precision and emulate double precision using 2 single precision floats what would the performance be like?

推荐答案

您可以通过计算实现每个双浮点操作所需的 float 操作数来粗略估计性能.您可能希望使用 cuobjdump --dump-sass 检查二进制代码以获得准确的计数.我在下面展示了一个双浮点乘法,它充分利用了 GPU 上的 FMA(融合乘加)支持.对于双浮点加法代码,我会为您指出 Andrew Thall 的论文,因为我没有时间编写代码现在就解决这个问题.从之前的分析来看,我认为论文中给出的加法代码是正确的,并且它避免了在更快但不太准确的实现中常见的陷阱(当操作数的大小在两倍以内时会失去准确性).

You can get a rough estimate of the performance by counting the number of float operations required to implement each double-float operation. You would want to inspect binary code with cuobjdump --dump-sass to get an accurate count. I am showing a double-float multiplication below that takes full advantage of FMA (fused multiply-add) support on the GPU. For double-float addition code, I would point you to a paper by Andrew Thall as I do not have the time to code this up right now. From previous analysis I believe the addition code given in the paper is correct, and that it avoids common pitfalls in faster but less accurate implementations (which lose accuracy when the magnitude of the operands is within a factor of two).

如果您是注册的 CUDA 开发者,您可以从 NVIDIA 的开发者网站(登录 https://developer.nvidia.com),它在 BSD 许可下,并将其相对快速地重新编写为双浮点代码.NVIDIA的double-double代码支持加减除法平方根倒数平方根运算.

If you are a registered CUDA developer you can download double-double code from NVIDIA's developer website (log in at https://developer.nvidia.com) which is under BSD license, and rework it relatively quickly into double-float code. NVIDIA's double-double code supports the operations addition, subtraction, division, square root, and reciprocal square root.

如你所见,下面的乘法需要8条float指令;一元否定被吸收到 FMA 中.添加需要大约 20 个 float 指令.但是,双浮点操作的指令序列也需要临时变量,这会增加寄存器压力并降低占用率.因此,一个合理保守的估计可能是双浮点算术的吞吐量是原生 float 算术的 1/20.您可以在与您相关的上下文(即您的用例)中轻松地自行衡量这一点.

As you can see, the multiplication below requires 8 float instructions; unary negation is absorbed into FMA. The addition requires around 20 float instructions. However, the instruction sequences for double-float operations also require temporary variables, which increases register pressure and can decrease occupancy. A reasonably conservative estimate may therefore be that double-float arithmetic performs at 1/20 the throughput of native float arithmetic. You can easily measure this yourself, in the context relevant to you, i.e. your use case(s).

typedef float2 dblfloat;  // .y = head, .x = tail

__host__ __device__ __forceinline__ 
dblfloat mul_dblfloat (dblfloat x, dblfloat y)
{
    dblfloat t, z;
    float sum;
    t.y = x.y * y.y;
    t.x = fmaf (x.y, y.y, -t.y);
    t.x = fmaf (x.x, y.x, t.x);
    t.x = fmaf (x.y, y.x, t.x);
    t.x = fmaf (x.x, y.y, t.x);
    /* normalize result */
    sum = t.y + t.x;
    z.x = (t.y - sum) + t.x;
    z.y = sum;
    return z;
}

请注意,在各种应用中,可能不需要完整的双浮点运算.相反,可以使用 float 计算,通过错误补偿技术增强,其中最古老的技术之一是 Kahan summation.我在 最近在 NVIDIA 开发者论坛中发帖.在上面的评论中,Robert Crovella 还提到了 Scott LeGrand 的 GTC 2015 演讲,我还没来得及检查.

Note that in various applications, full double-float arithmetic may not be necessary. Instead one can use float computation, augmented by error compensating techniques, one of the oldest of which is the Kahan summation. I gave a brief overview of easily available literature on such methods in a recent posting in the NVIDIA developer forums. In the comments above, Robert Crovella also pointed to a GTC 2015 talk by Scott LeGrand, which I haven't had time to check out yet.

就精度而言,双浮点数的表示精度为 49 (24+24+1) 位,而 IEEE-755 double 提供 53 位.然而,双浮点数不能为数量级小的操作数保持这种精度,因为尾部可能变为非正规或零.开启非正规支持后,2-101 <= |x| 的 49 位精度保证<2128.float 的非规范支持在架构 >= sm_20 的 CUDA 工具链中默认打开,这意味着当前发布的版本 CUDA 7.0 支持的所有架构.

As for accuracy, double-float has a representational precision of 49 (24+24+1) bits, compared with IEEE-755 double which provides 53 bits. However double-float cannot maintain this precision for operands small in magnitude, as the tail portion can become a denormal or zero. When denormal support is turned on, the 49 bits of precision are guaranteed for 2-101 <= |x| < 2128. Denormal support for float is turned on by default in the CUDA tool chain for architectures >= sm_20, which means all architectures supported by the currently shipping version, CUDA 7.0.

与对 IEEE-754 double 数据的运算相反,双浮点运算未正确舍入.对于上面的双浮点乘法,使用 20 亿个随机测试用例(所有源操作数和结果都在上述范围内),我观察到相对误差的上限为 1.42e-14.我没有双浮点加法的数据,但它的误差范围应该是相似的.

As opposed to operations on IEEE-754 double data, double-float operations are not correctly rounded. For the double-float multiplication above, using 2 billion random test cases (with all source operands and results within the bounds stated above), I observed an upper bound of 1.42e-14 for the relative error. I do not have data for the double-float addition, but its error bound should be similar.

这篇关于在 GPU 上用 2 个 FP32 模拟 FP64的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆