将整数向量转换为 0 到 1 之间的浮点数的最快精确方法 [英] Fastest precise way to convert a vector of integers into floats between 0 and 1

查看:76
本文介绍了将整数向量转换为 0 到 1 之间的浮点数的最快精确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个随机生成的 __m256i 向量.有没有比除以 0(包含)和 1(不包含)之间的浮点数转换为 __m256 向量的更准确的方法>float(1ull<<32)?

Consider a randomly generated __m256i vector. Is there a faster precise way to convert them into __m256 vector of floats between 0 (inclusively) and 1 (exclusively) than division by float(1ull<<32)?

这是我到目前为止所尝试的,其中 iRand 是输入,ans 是输出:

Here's what I have tried so far, where iRand is the input and ans is the output:

const __m256 fRand = _mm256_cvtepi32_ps(iRand);
const __m256 normalized = _mm256_div_ps(fRand, _mm256_set1_ps(float(1ull<<32)));
const __m256 ans = _mm256_add_ps(normalized, _mm256_set1_ps(0.5f));

推荐答案

与使用 _mm256_div_ps

vdivps 很慢,例如在我的 Haswell Xeon 上,延迟为 18-21 个周期,吞吐量为 14 个周期.较新的 CPU 性能更好,在 Skylake 上为 11/5,在 Ryzen 上为 10/6.

vdivps is quite slow, e.g. on my Haswell Xeon it’s 18-21 cycles latency, 14 cycles throughput. Newer CPUs perform better BTW, it’s 11/5 on Skylake, 10/6 on Ryzen.

正如评论中所说,性能可以通过用乘法代替除法来修复,并用 FMA 进一步改进.这种方法的问题在于分发质量.如果您尝试通过舍入模式或剪裁在输出区间中获取这些数字,则会在输出数字的概率分布中引入峰值.

As said in the comments, the performance is fixable by replacing divide with multiply and further improved with FMA. The problem with the approach is quality of distribution. If you’ll try to get these numbers in your output interval by rounding mode or clipping, you’ll introduce peaks in probability distribution of the output numbers.

我的实现也不理想,它没有输出输出区间内所有可能的值,跳过了许多可表示的浮点数,尤其是接近 0.但至少分布是非常均匀的.

My implementation is not ideal either, it doesn’t output all possible values in the output interval, skips many representable floats, especially near 0. But at least the distribution is very even.

__m256 __vectorcall randomFloats( __m256i randomBits )
{
    // Convert to random float bits
    __m256 result = _mm256_castsi256_ps( randomBits );

    // Zero out exponent bits, leave random bits in mantissa.
    // BTW since the mask value is constexpr, we don't actually need AVX2 instructions for this, it's just easier to code with set1_epi32.
    const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
    result = _mm256_and_ps( result, mantissaMask );

    // Set sign + exponent bits to that of 1.0, which is sign=0, exponent=2^0.
    const __m256 one = _mm256_set1_ps( 1.0f );
    result = _mm256_or_ps( result, one );

    // Subtract 1.0. The above algorithm generates floats in range [1..2).
    // Can't use bit tricks to generate floats in [0..1) because it would cause them to be distributed very unevenly.
    return _mm256_sub_ps( result, one );
}

更新:如果您想要更高的精度,请使用以下版本.但它不再是最快的".

Update: if you want better precision, use the following version. But it’s no longer "the fastest".

__m256 __vectorcall randomFloats_32( __m256i randomBits )
{
    // Convert to random float bits
    __m256 result = _mm256_castsi256_ps( randomBits );
    // Zero out exponent bits, leave random bits in mantissa.
    const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
    result = _mm256_and_ps( result, mantissaMask );
    // Set sign + exponent bits to that of 1.0, which is sign=0, exponent = 2^0.
    const __m256 one = _mm256_set1_ps( 1.0f );
    result = _mm256_or_ps( result, one );
    // Subtract 1.0. The above algorithm generates floats in range [1..2).
    result = _mm256_sub_ps( result, one );

    // Use 9 unused random bits to add extra randomness to the lower bits of the values.
    // This increases precision to 2^-32, however most floats in the range can't store that many bits, fmadd will only add them for small enough values.

    // If you want uniformly distributed floats with 2^-24 precision, replace the second argument in the following line with _mm256_set1_epi32( 0x80000000 ).
    // In this case you don't need to set rounding mode bits in MXCSR.
    __m256i extraBits = _mm256_and_si256( randomBits, _mm256_castps_si256( mantissaMask ) );
    extraBits = _mm256_srli_epi32( extraBits, 9 );
    __m256 extra = _mm256_castsi256_ps( extraBits );
    extra = _mm256_or_ps( extra, one );
    extra = _mm256_sub_ps( extra, one );
    _MM_SET_ROUNDING_MODE( _MM_ROUND_DOWN );
    constexpr float mul = 0x1p-23f; // The initial part of the algorithm has generated uniform distribution with the step 2^-23.
    return _mm256_fmadd_ps( extra, _mm256_set1_ps( mul ), result );
}

这篇关于将整数向量转换为 0 到 1 之间的浮点数的最快精确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆