通过快速浮点倒数高效计算 2**64/除数 [英] Efficient computation of 2**64 / divisor via fast floating-point reciprocal

查看:29
本文介绍了通过快速浮点倒数高效计算 2**64/除数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究如何使用各种现代处理器的快速单精度浮点倒数功能来计算基于定点 Newton-Raphson 迭代的 64 位无符号整数除法的起始近似值.它需要计算 264/除数,尽可能准确,其中初始近似值必须小于或等于数学结果,基于以下定点迭代的要求.这意味着这个计算需要提供一个低估.我目前有以下代码,基于广泛的测试,运行良好:

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:

#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()

uint64_t divisor, recip;
float r, s, t;

t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor 

虽然此代码可以正常运行,但在大多数平台上运行速度并不快.一个明显的改进需要一些特定于机器的代码,是用利用硬件提供的快速浮点倒数的代码替换除法 r = 1.0f/t.这可以通过迭代进行扩充,以产生与数学结果相差 1 ulp 的结果,因此在现有代码的上下文中会产生低估.x86_64 的示例实现是:

While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:

#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}

nextafterf() 的实现通常没有进行性能优化.在可以通过内在函数 float_as_int()float_as_int() 快速将 IEEE 754 binary32 重新解释为 int32 的平台上,反之亦然code>int_as_float(),我们可以结合使用nextafterf()和缩放如下:

Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:

s = int_as_float (float_as_int (r) + 0x1fffffff);

假设这些方法在给定的平台上是可行的,这给我们留下了floatuint64_t 之间的转换作为主要障碍.大多数平台不提供使用静态舍入模式执行从 uint64_tfloat 的转换的指令(这里:向正无穷大=向上),有些平台不提供在 uint64_t 和浮点类型之间转换的任何指令,使其成为性能瓶颈.

Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.

t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

uint64_to_float_ru 的可移植但速度较慢的实现使用对 FPU 舍入模式的动态更改:

A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

float uint64_to_float_ru (uint64_t a)
{
    float res;
    int curr_mode = fegetround ();
    fesetround (FE_UPWARD);
    res = (float)a;
    fesetround (curr_mode);
    return res;
}

我研究了各种拆分和位处理方法来处理转换(例如,在整数端进行舍入,然后使用正常转换为使用 IEEE 754 舍入模式的 float舍入到最近或偶数),但由此产生的开销使得通过快速浮点倒数进行的计算从性能角度来看没有吸引力.就目前而言,看起来我最好通过使用带插值的经典 LUT 或定点多项式近似来生成初始近似值,然后使用 32 位定点 Newton-Raphson 步骤进行跟踪.

I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.

有没有办法提高我当前方法的效率?涉及特定平台内在函数的便携式和半便携式方法会引起人们的兴趣(特别是对于 x86 和 ARM 作为当前占主导地位的 CPU 架构).使用英特尔编译器以非常高的优化 (/O3/QxCORE-AVX2/Qprec-div-) 为 x86_64 进行编译,初始近似的计算比迭代需要更多的指令,迭代需要大约 20 条指令.下面是完整的划分代码供参考,在上下文中显示了近似值.

Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.

uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
    uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
    float r, s, t;

    /* compute initial approximation for reciprocal; must be underestimate! */
    t = uint64_to_float_ru (divisor);
    r = 1.0f / t;
    s = 0x1.0p64f * nextafterf (r, 0.0f);
    recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

    /* perform Halley iteration with cubic convergence to refine reciprocal */
    temp = neg_divisor * recip;
    temp = umul64hi (temp, temp) + temp;
    recip = umul64hi (recip, temp) + recip;

    /* compute preliminary quotient and remainder */
    quot = umul64hi (dividend, recip); 
    rem = dividend - divisor * quot;

    /* adjust quotient if too small; quotient off by 2 at most */
    if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;

    /* handle division by zero */
    if (divisor == 0ULL) quot = ~0ULL;

    return quot;
}

umul64hi() 通常会映射到特定于平台的内部代码或一些内联汇编代码.在 x86_64 上,我目前使用此实现:

umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:

inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
    uint64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"  // rax = a
        "mulq  %2;\n\t"         // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"  // res = (a * b)<63:32>
        : "=rm" (res)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    return res;
}

推荐答案

这个解决方案结合了两个想法:

This solution combines two ideas:

  • 您可以通过简单地将位重新解释为浮点数并减去一个常数来转换为浮点数,只要数字在特定范围内即可.因此,添加一个常量,重新解释,然后减去该常量.这将给出一个截断的结果(因此总是小于或等于所需的值).
  • 您可以通过否定指数和尾数来近似倒数.这可以通过将位解释为 int 来实现.

此处的选项 1 仅在特定范围内有效,因此我们检查范围并调整使用的常量.这适用于 64 位,因为所需的浮点数只有 23 位精度.

Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.

此代码中的结果将是双精度值,但转换为浮点数是微不足道的,可以在位上或直接完成,具体取决于硬件.

The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.

在此之后,您会想要进行 Newton-Raphson 迭代.

After this you'd want to do the Newton-Raphson iteration(s).

大部分代码只是转换为幻数.

Much of this code simply converts to magic numbers.

double                                                       
u64tod_inv( uint64_t u64 ) {                                 
  __asm__( "#annot0" );                                      
  union {                                                    
    double f;                                                
    struct {                                                 
      unsigned long m:52; // careful here with endianess     
      unsigned long x:11;                                    
      unsigned long s:1;                                     
    } u64;                                                   
    uint64_t u64i;                                           
  } z,                                                       
        magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },        
        magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },   
        magic2 = { .u64 = { 0, 2046, 0 } };                  

  __asm__( "#annot1" );                                      
  if( u64 < (1UL << 52UL ) ) {                               
    z.u64i = u64 + magic0.u64i;                              
    z.f   -= magic0.f;                                       
  } else {                                                   
    z.u64i = ( u64 >> 12 ) + magic1.u64i;                    
    z.f   -= magic1.f;                                       
  }                                                          
  __asm__( "#annot2" );                                      

  z.u64i = magic2.u64i - z.u64i;                             

  return z.f;                                                
}                                                            

在英特尔内核 7 上编译它会给出许多指令(和一个分支),但当然,根本没有乘法或除法.如果 int 和 double 之间的转换速度很快,这应该运行得很快.

Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.

我怀疑浮点数(只有 23 位精度)需要超过 1 或 2 次 Newton-Raphson 迭代才能获得您想要的精度,但我还没有计算过...

I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

这篇关于通过快速浮点倒数高效计算 2**64/除数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆