在AVX2中有效实现log2(__ m256d) [英] Efficient implementation of log2(__m256d) in AVX2

查看:2449
本文介绍了在AVX2中有效实现log2(__ m256d)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在其他编译器上,SVML的 __ m256d _mm256_log2_pd(__m256d a)是不可用的,他们说它的性能在AMD处理器上是有缺陷的。在互联网上有一些实现在

掠夺者:最后看到如何将性能提高到每秒0.87亿对数。
$ b 特殊情况:
带负号的负数,负无穷和 NaN s被视为非常接近0(导致一些垃圾很大的负对数值) 。正无穷和 NaN 与正符号位的结果是1024左右的对数。如果您不喜欢如何处理特殊情况,一个选项是添加代码,检查他们做什么更适合你。这会使计算速度变慢。

pre $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ // $的双打,并
/ / 20位尾数在那里,1位是用于四舍五入。
constexpr uint8_t cnLog2TblBits = 10; // 1024个数字乘8个字节= 8KB。
constexpr uint16_t cZeroExp = 1023;
const __m256i gDoubleNotExp = _mm256_set1_epi64x(〜(0x7ffull<< 52));
const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL <= lt; 52));
const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
〜((1ULL <(lt;(52-cnLog2TblBits)) - 1));
const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
1ULL <<(52 - cnLog2TblBits - 1)));
const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0 / ln(2)
const __m256i gHigh32Permute = _mm256_set_epi32(0,0,0,7,5,3,1); ((1 const __m128i cSseMantTblMask = _mm_set1_epi32((1 const __m128i gExpNorm0 = _mm_set1_epi32(1023);
// plus | cnLog2TblBits | th最高尾数比特
double gPlusLog2Table [1 << cnLog2TblBits];
} //匿名命名空间

void InitLog2Table(){
for(uint32_t i = 0; i <(1 const uint64_t iZp =(uint64_t(cZeroExp)<< 52)
| (uint64_t(i)<<(52 - cnLog2TblBits))| (1ULL <(lt;(52-cnLog2TblBits-1));
const double zp = * reinterpret_cast< const double *>(& iZp);
const double l2zp = std :: log2(zp);
gPlusLog2Table [i] = l2zp;


$ b $ _ _m256d __vectorcall Log2TblPlus(__ m256d x){
const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp),x);
const __m256d z = _mm256_or_pd(zClearExp,gDoubleExp0);

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
_mm256_castpd_si256(x),gHigh32Permute));
//这就要求x是非负的,因为符号位在
// //计算指数之前没有被清除。
const __m128i exps32 = _mm_srai_epi32(high32,20);
const __m128i normExps = _mm_sub_epi32(exps32,gExpNorm0);

//计算y约等于log2(z)
const __m128i indexes = _mm_and_si128(cSseMantTblMask,
_mm_srai_epi32(high32,20 - cnLog2TblBits));
const __m256d y = _mm256_i32gather_pd(gPlusLog2Table,indexes,
/ *每个字节的字节数* / 8);
//计算A为z / exp2(y)
const __m256d exp2_Y = _mm256_or_pd(
cPlusBit,_mm256_and_pd(z,_mm256_castsi256_pd(cAvxExp2YMask)));

//计算t =(A-1)/(A + 1)。分子和分母除以exp2_Y
const __m256d tNum = _mm256_sub_pd(z,exp2_Y);
const __m256d tDen = _mm256_add_pd(z,exp2_Y);

//从https://en.wikipedia.org/wiki/Logarithm#Power_series
const __m256d t = _mm256_div_pd(tNum,tDen)的更高效的系列计算第一个多项式项);

const __m256d log2_z = _mm256_fmadd_pd(t,gCommMul1,y);

//前导整数部分为对数
const __m256d前导= _mm256_cvtepi32_pd(normExps);

const __m256d log2_x = _mm256_add_pd(log2_z,leading);
返回log2_x;



$ b $ p
$ b

它使用查找表法和1次多项式的组合,主要描述在Wikipedia上(链接在代码注释中)。我可以在这里分配8KB的L1缓存(这是每个逻辑核心可用的16KB L1缓存的一半),因为对数计算对我来说真的是瓶颈,没有什么需要L1缓存的东西。

但是,如果您需要更多的一级缓存来满足其他需求,则可以通过减少 cnLog2TblBits 例如或者为了保持高精度,可以通过添加以下内容来增加多项式项的数量:


$ b

  namespace {
// ...
const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11); (b





然后改变 Log2TblPlus() / code> const __m256d t = _mm256_div_pd(tNum,tDen);

  const __m256d t2 = _mm256_mul_pd(t,t); // t ** 2 

const __m256d t3 = _mm256_mul_pd(t,t2); // t ** 3
const __m256d terms01 = _mm256_fmadd_pd(gCoeff1,t3,t);
const __m256d t5 = _mm256_mul_pd(t3,t2); // t ** 5
const __m256d terms012 = _mm256_fmadd_pd(gCoeff2,t5,terms01);
const __m256d t7 = _mm256_mul_pd(t5,t2); // t ** 7
const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3,t7,terms012);
const __m256d t9 = _mm256_mul_pd(t7,t2); // t ** 9
const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4,t9,terms0123);
const __m256d t11 = _mm256_mul_pd(t9,t2); // t ** 11
const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5,t11,terms01234);

const __m256d log2_z = _mm256_fmadd_pd(terms012345,gCommMul1,y);

然后注释 //对数的前导整数部分

通常情况下,您不需要这么多的条件,即使对于一些位表,我也只是提供参考的系数和计算。很可能如果 cnLog2TblBits == 5 ,除了 terms012 之外,你不需要任何东西。但是我还没有做这样的测量,你需要试验一下你的需求是什么。

计算的多项式项越少,计算速度越快。

p>




编辑:这个问题
$ $ p $ const __m256d y = _mm256_i32gather_pd(gPlusLog2Table,索引,
/ *每字节数item * / 8);

被替换为

  const __m256d y = _mm256_set_pd(gPlusLog2Table [indexes.m128i_u32 [3]],
gPlusLog2Table [indexes.m128i_u32 [2]],
gPlusLog2Table [indexes.m128i_u32 [1]],
gPlusLog2Table [indexes.m128i_u32 [0]]);

对于我的实现,它节省了大约1.5个周期,减少了总周期数,计算4个对数从18到16.5,从而表现上升到每秒0.87亿对数。我现在离开当前的实现,因为它更习惯,一旦CPU开始正确执行 gather 操作(像GPU一样合并),就应该更快。



EDIT2 ,你可以通过替换$ / $ $来获得更多的加速(约0.5个周期) b
$ b

  const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
_mm256_castpd_si256(x),gHigh32Permute));



  const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x,1)); 
const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane,hiLane,
_MM_SHUFFLE(3,1,3,1)));


SVML's __m256d _mm256_log2_pd (__m256d a) is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? and SIMD math libraries for SSE and AVX , however they seem to be more SSE than AVX2. There's also Agner Fog's vector library , however it's a large library having much more stuff that just vector log2, so from the implementation in it it's hard to figure out the essential parts for just the vector log2 operation.

So can someone just explain how to implement log2() operation for a vector of 4 double numbers efficiently? I.e. like what __m256d _mm256_log2_pd (__m256d a) does, but available for other compilers and reasonably efficient for both AMD and Intel processors.

EDIT: In my current specific case, the numbers are probabilities between 0 and 1, and logarithm is used for entropy computation: the negation of sum over all i of P[i]*log(P[i]). The range of floating-point exponents for P[i] is large, so the numbers can be close to 0. I'm not sure about accuracy, so would consider any solution starting with 30 bits of mantissa, especially a tuneable solution is preferred.

EDIT2: here is my implementation so far, based on "More efficient series" from https://en.wikipedia.org/wiki/Logarithm#Power_series . How can it be improved? (both performance and accuracy improvements are desired)

namespace {
  const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);
  const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);
  const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);
  const __m128i gExpNormalizer = _mm_set1_epi32(1023);
  //TODO: some 128-bit variable or two 64-bit variables here?
  const __m256d gCommMul = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gVect1 = _mm256_set1_pd(1.0);
}

__m256d __vectorcall Log2(__m256d x) {
  const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
  const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
  const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
  const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
  const __m256d expsPD = _mm256_cvtepi32_pd(normExps);
  const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
    _mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));

  // Calculate t=(y-1)/(y+1) and t**2
  const __m256d tNum = _mm256_sub_pd(y, gVect1);
  const __m256d tDen = _mm256_add_pd(y, gVect1);
  const __m256d t = _mm256_div_pd(tNum, tDen);
  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);

  const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
  const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);

  return log2_x;
}

So far my implementation gives 405 268 490 operations per second, and it seems precise till the 8th digit. The performance is measured with the following function:

#include <chrono>
#include <cmath>
#include <cstdio>
#include <immintrin.h>

// ... Log2() implementation here

const int64_t cnLogs = 100 * 1000 * 1000;

void BenchmarkLog2Vect() {
  __m256d sums = _mm256_setzero_pd();
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i += 4) {
    const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));
    const __m256d logs = Log2(x);
    sums = _mm256_add_pd(sums, logs);
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  double sum = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];
  printf("Vect Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

Comparing to the results of Logarithm in C++ and assembly , the current vector implementation is 4 times faster than std::log2() and 2.5 times faster than std::log().

Specifically, the following approximation formula is used:

解决方案

Finally here is my best result which on Ryzen 1800X @3.6GHz gives about 0.8 billion of logarithms per second (200 million vectors of 4 logarithms in each) in a single thread, and is accurate till a few last bits in the mantissa. Spoiler: see in the end how to increase performance to 0.87 billion logarithms per second.

Special cases: Negative numbers, negative infinity and NaNs with negative sign bit are treated as if they are very close to 0 (result in some garbage large negative "logarithm" values). Positive infinity and NaNs with positive sign bit result in a logarithm around 1024. If you don't like how special cases are treated, one option is to add code that checks for them and does what suits you better. This will make the computation slower.

namespace {
  // The limit is 19 because we process only high 32 bits of doubles, and out of
  //   20 bits of mantissa there, 1 bit is used for rounding.
  constexpr uint8_t cnLog2TblBits = 10; // 1024 numbers times 8 bytes = 8KB.
  constexpr uint16_t cZeroExp = 1023;
  const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));
  const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));
  const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
    ~((1ULL << (52-cnLog2TblBits)) - 1) );
  const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
    1ULL << (52 - cnLog2TblBits - 1)));
  const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
  const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);
  const __m128i gExpNorm0 = _mm_set1_epi32(1023);
  // plus |cnLog2TblBits|th highest mantissa bit
  double gPlusLog2Table[1 << cnLog2TblBits];
} // anonymous namespace

void InitLog2Table() {
  for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {
    const uint64_t iZp = (uint64_t(cZeroExp) << 52)
      | (uint64_t(i) << (52 - cnLog2TblBits)) | (1ULL << (52 - cnLog2TblBits - 1));
    const double zp = *reinterpret_cast<const double*>(&iZp);
    const double l2zp = std::log2(zp);
    gPlusLog2Table[i] = l2zp;
  }
}

__m256d __vectorcall Log2TblPlus(__m256d x) {
  const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);
  const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);

  const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
    _mm256_castpd_si256(x), gHigh32Permute));
  // This requires that x is non-negative, because the sign bit is not cleared before
  //   computing the exponent.
  const __m128i exps32 = _mm_srai_epi32(high32, 20);
  const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);

  // Compute y as approximately equal to log2(z)
  const __m128i indexes = _mm_and_si128(cSseMantTblMask,
    _mm_srai_epi32(high32, 20 - cnLog2TblBits));
  const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
    /*number of bytes per item*/ 8);
  // Compute A as z/exp2(y)
  const __m256d exp2_Y = _mm256_or_pd(
    cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));

  // Calculate t=(A-1)/(A+1). Both numerator and denominator would be divided by exp2_Y
  const __m256d tNum = _mm256_sub_pd(z, exp2_Y);
  const __m256d tDen = _mm256_add_pd(z, exp2_Y);

  // Compute the first polynomial term from "More efficient series" of https://en.wikipedia.org/wiki/Logarithm#Power_series
  const __m256d t = _mm256_div_pd(tNum, tDen);

  const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);

  // Leading integer part for the logarithm
  const __m256d leading = _mm256_cvtepi32_pd(normExps);

  const __m256d log2_x = _mm256_add_pd(log2_z, leading);
  return log2_x;
}

It uses a combination of lookup table approach and a 1st degree polynomial, mostly described on Wikipedia (the link is in the code comments). I can afford to allocate 8KB of L1 cache here (which is a half of 16KB L1 cache available per logical core), because logarithm computation is really the bottleneck for me and there is not much more anything that needs L1 cache.

However, if you need more L1 cache for the other needs, you can decrease the amount of cache used by logarithm algorithm by reducing cnLog2TblBits to e.g. 5 at expense of decreasing the accuracy of logarithm computation.

Or to keep the accuracy high, you can increase the number of polynomial terms by adding:

namespace {
  // ...
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11);
}

And then changing the tail of Log2TblPlus() after line const __m256d t = _mm256_div_pd(tNum, tDen);:

  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
  const __m256d t11 = _mm256_mul_pd(t9, t2); // t**11
  const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5, t11, terms01234);

  const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);

Then comment // Leading integer part for the logarithm and the rest unchanged follow.

Normally you don't need that many terms, even for a few-bit table, I just provided the coefficients and computations for reference. It's likely that if cnLog2TblBits==5, you won't need anything beyond terms012. But I haven't done such measurements, you need to experiment what suits your needs.

The less polynomial terms you compute, obviously, the faster the computations are.


EDIT: this question In what situation would the AVX2 gather instructions be faster than individually loading the data? suggests that you may get a performance improvement if

const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
  /*number of bytes per item*/ 8);

is replaced by

const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],
  gPlusLog2Table[indexes.m128i_u32[2]],
  gPlusLog2Table[indexes.m128i_u32[1]],
  gPlusLog2Table[indexes.m128i_u32[0]]);

For my implementation it saves about 1.5 cycle, reducing the total cycle count to compute 4 logarithms from 18 to 16.5, thus the performance rises to 0.87 billion logarithms per second. I'm leaving the current implementation as is because it's more idiomatic and shoud be faster once the CPUs start doing gather operations right (with coalescing like GPUs do).

EDIT2: on Ryzen CPU (but not on Intel) you can get a little more speedup (about 0.5 cycle) by replacing

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
  _mm256_castpd_si256(x), gHigh32Permute));

with

  const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
  const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
  const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,
    _MM_SHUFFLE(3, 1, 3, 1)));

这篇关于在AVX2中有效实现log2(__ m256d)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆