通过快速浮点倒数高效地计算2 ** 64/除数 [英] Efficient computation of 2**64 / divisor via fast floating-point reciprocal

查看:72
本文介绍了通过快速浮点倒数高效地计算2 ** 64/除数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究使用各种现代处理器的快速单精度浮点倒数功能来基于定点Newton-Raphson迭代计算64位无符号整数除法的起始近似值的方法.它要求根据以下定点迭代的要求,尽可能精确地计算2 64 /除数,其中初始近似值必须小于或等于数学结果.这意味着该计算需要提供一个低估的数据.我目前有以下代码,基于广泛的测试,这些代码效果很好:

 #include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()

uint64_t divisor, recip;
float r, s, t;

t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor 
 

尽管此代码可以正常运行,但在大多数平台上并不能很快实现.一个明显的改进(需要一些机器特定的代码)是将除法r = 1.0f / t替换为利用硬件提供的快速浮点倒数的代码.可以通过迭代来增加该值,以产生在数学结果的1 ulp以内的结果,因此在现有代码的上下文中会产生低估. x86_64的示例实现为:

 #include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}
 

nextafterf()的实现通常未对性能进行优化.在可以通过内在函数float_as_int()int_as_float()快速将IEEE 754 binary32解释为int32以及反之的平台上,我们可以结合使用nextafterf()和按比例缩放,如下所示:/p>

s = int_as_float (float_as_int (r) + 0x1fffffff);

假设在给定平台上可以使用这些方法,则这使floatuint64_t之间的转换成为主要障碍.大多数平台不提供使用静态舍入模式执行从uint64_tfloat的转换的指令(此处:向正无穷大=向上),并且某些平台不提供任何在uint64_t和浮动之间进行转换的指令.点类型,这成为性能瓶颈.

 t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
 

uint64_to_float_ru的一种可移植但缓慢的实现方式是对FPU舍入模式进行动态更改:

 #include <fenv.h>
#pragma STDC FENV_ACCESS ON

float uint64_to_float_ru (uint64_t a)
{
    float res;
    int curr_mode = fegetround ();
    fesetround (FE_UPWARD);
    res = (float)a;
    fesetround (curr_mode);
    return res;
}
 

我研究了各种拆分和位扭曲方法来处理转换(例如,在整数端进行舍入,然后使用正常转换为float,该转换使用IEEE 754舍入模式从最接近值舍入-或-甚至),但由此产生的开销使从性能角度来看,通过快速浮点数倒数不吸引人的这一计算成为可能.就目前而言,我似乎最好使用带插值的经典LUT或定点多项式逼近来生成起始逼近,然后再执行32位定点Newton-Raphson步骤./p>

有没有办法提高我当前方法的效率?涉及特定平台的内在函数的便携式和半便携式方法会引起人们的关注(特别是对于x86和ARM作为当前占主导地位的CPU架构) ).使用Intel编译器以极高的优化(/O3 /QxCORE-AVX2 /Qprec-div-)为x86_64进行编译,初始近似值的计算要比迭代(大约需要20条指令)花费更多的指令.下面是完整的除法代码以供参考,显示了上下文中的近似值.

 uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
    uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
    float r, s, t;

    /* compute initial approximation for reciprocal; must be underestimate! */
    t = uint64_to_float_ru (divisor);
    r = 1.0f / t;
    s = 0x1.0p64f * nextafterf (r, 0.0f);
    recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

    /* perform Halley iteration with cubic convergence to refine reciprocal */
    temp = neg_divisor * recip;
    temp = umul64hi (temp, temp) + temp;
    recip = umul64hi (recip, temp) + recip;

    /* compute preliminary quotient and remainder */
    quot = umul64hi (dividend, recip); 
    rem = dividend - divisor * quot;

    /* adjust quotient if too small; quotient off by 2 at most */
    if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;

    /* handle division by zero */
    if (divisor == 0ULL) quot = ~0ULL;

    return quot;
}
 

umul64hi()通常将映射到特定于平台的内在代码或一些内联汇编代码.在x86_64上,我目前使用此实现:

 inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
    uint64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"  // rax = a
        "mulq  %2;\n\t"         // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"  // res = (a * b)<63:32>
        : "=rm" (res)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    return res;
}
 

解决方案

此解决方案结合了两个想法:

  • 只要将数字重新解释为浮点并减去一个常数(只要数字在特定范围内),就可以转换为浮点.因此,添加一个常数,重新解释,然后减去该常数.这将给出一个截断的结果(因此,它总是小于或等于所需的值).
  • 您可以通过同时取指数和尾数来近似倒数.这可以通过将这些位解释为int来实现.

此处的选项1仅在一定范围内有效,因此我们检查范围并调整使用的常数.之所以可以使用64位,是因为所需的浮点数只有23位的精度.

此代码中的结果将是两倍,但是转换为浮点数是微不足道的,并且可以在位上或直接完成,具体取决于硬件.

此后,您需要进行Newton-Raphson迭代.

大部分代码都可以简单地转换为幻数.

double                                                       
u64tod_inv( uint64_t u64 ) {                                 
  __asm__( "#annot0" );                                      
  union {                                                    
    double f;                                                
    struct {                                                 
      unsigned long m:52; // careful here with endianess     
      unsigned long x:11;                                    
      unsigned long s:1;                                     
    } u64;                                                   
    uint64_t u64i;                                           
  } z,                                                       
        magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },        
        magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },   
        magic2 = { .u64 = { 0, 2046, 0 } };                  

  __asm__( "#annot1" );                                      
  if( u64 < (1UL << 52UL ) ) {                               
    z.u64i = u64 + magic0.u64i;                              
    z.f   -= magic0.f;                                       
  } else {                                                   
    z.u64i = ( u64 >> 12 ) + magic1.u64i;                    
    z.f   -= magic1.f;                                       
  }                                                          
  __asm__( "#annot2" );                                      

  z.u64i = magic2.u64i - z.u64i;                             

  return z.f;                                                
}                                                            

在Intel核心7上编译可提供许多指令(和分支),但是,当然根本没有乘除运算.如果int和double之间的转换速度很快,则应该运行得很快.

我怀疑浮点数(只有23位的精度)将需要超过1或2次Newton-Raphson迭代才能获得所需的精度,但是我尚未进行数学运算...

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:

#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()

uint64_t divisor, recip;
float r, s, t;

t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor 

While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:

#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
    __m128 t;
    float e, r;
    t = _mm_set_ss (a);
    t = _mm_rcp_ss (t);
    _mm_store_ss (&r, t);
    e = fmaf (r, -a, 1.0f);
    e = fmaf (e, e, e);
    r = fmaf (e, r, r);
    return r;
}

Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:

s = int_as_float (float_as_int (r) + 0x1fffffff);

Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.

t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:

#include <fenv.h>
#pragma STDC FENV_ACCESS ON

float uint64_to_float_ru (uint64_t a)
{
    float res;
    int curr_mode = fegetround ();
    fesetround (FE_UPWARD);
    res = (float)a;
    fesetround (curr_mode);
    return res;
}

I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.

Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.

uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
    uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
    float r, s, t;

    /* compute initial approximation for reciprocal; must be underestimate! */
    t = uint64_to_float_ru (divisor);
    r = 1.0f / t;
    s = 0x1.0p64f * nextafterf (r, 0.0f);
    recip = (uint64_t)s; /* underestimate of 2**64 / divisor */

    /* perform Halley iteration with cubic convergence to refine reciprocal */
    temp = neg_divisor * recip;
    temp = umul64hi (temp, temp) + temp;
    recip = umul64hi (recip, temp) + recip;

    /* compute preliminary quotient and remainder */
    quot = umul64hi (dividend, recip); 
    rem = dividend - divisor * quot;

    /* adjust quotient if too small; quotient off by 2 at most */
    if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;

    /* handle division by zero */
    if (divisor == 0ULL) quot = ~0ULL;

    return quot;
}

umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:

inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
    uint64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"  // rax = a
        "mulq  %2;\n\t"         // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"  // res = (a * b)<63:32>
        : "=rm" (res)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    return res;
}

解决方案

This solution combines two ideas:

  • You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
  • You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.

Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.

The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.

After this you'd want to do the Newton-Raphson iteration(s).

Much of this code simply converts to magic numbers.

double                                                       
u64tod_inv( uint64_t u64 ) {                                 
  __asm__( "#annot0" );                                      
  union {                                                    
    double f;                                                
    struct {                                                 
      unsigned long m:52; // careful here with endianess     
      unsigned long x:11;                                    
      unsigned long s:1;                                     
    } u64;                                                   
    uint64_t u64i;                                           
  } z,                                                       
        magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },        
        magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },   
        magic2 = { .u64 = { 0, 2046, 0 } };                  

  __asm__( "#annot1" );                                      
  if( u64 < (1UL << 52UL ) ) {                               
    z.u64i = u64 + magic0.u64i;                              
    z.f   -= magic0.f;                                       
  } else {                                                   
    z.u64i = ( u64 >> 12 ) + magic1.u64i;                    
    z.f   -= magic1.f;                                       
  }                                                          
  __asm__( "#annot2" );                                      

  z.u64i = magic2.u64i - z.u64i;                             

  return z.f;                                                
}                                                            

Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.

I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

这篇关于通过快速浮点倒数高效地计算2 ** 64/除数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆