使用SIMD提高数组浮点点积的性能 [英] Improving performance of floating-point dot-product of an array with SIMD

查看:100
本文介绍了使用SIMD提高数组浮点点积的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我具有此功能来计算一个double数组:

I have this function to compute a piece of array of double's:

void avx2_mul_64_block(double& sum, double* lhs_arr, double* rhs_arr) noexcept
{
    __m256i accumulator = _mm256_setzero_pd();

    for (std::size_t block = 0; block < 64; block += 4)
    {
        __m256i lhs = _mm256_set_pd(
            lhs_arr[block    ],
            lhs_arr[block + 1],
            lhs_arr[block + 2],
            lhs_arr[block + 3]);

        __m256i rhs = _mm256_set_pd(
            rhs_arr[block    ],
            rhs_arr[block + 1],
            rhs_arr[block + 2],
            rhs_arr[block + 3]);

        accumulator = _mm256_add_pd(accumulator, _mm256_mul_pd(lhs, rhs));
    }
    double* res = reinterpret_cast<double*>(&accumulator);
    sum += res[0] + res[1] + res[2] + res[3];
}

,此代码的性能不是我想要的.我相信让它成功-避免为所有元素的总和创建双精度数组,但是我不知道一种实现方法.

, and performance of this code is not what i wanted. I believe that let it succeed - avoid creating of double array for sum all its elements, but i don't know a way to do it.

顺便说一句,与 _mm256_setzero_si256 相比, _mm256_setzero_pd 将整个功能减慢了一半.

By the way, _mm256_setzero_pd slows down the whole function in half compared to _mm256_setzero_si256.

我的标志: -O3 -ftree-vectorize -march = native

P.S.这不是一个真正的问题,只是设计问题.

P.S. It's not a real problem, just design question.

推荐答案

评论中已经提到了一些建议,但是我会尝试提出一些建议.

Some of the suggestions were already mentioned in the comments, but I will try to suggest something in addition to that.

在Haswell和更新的CPU上,大多数SIMD浮点指令的互惠吞吐量小于其延迟,这意味着如果并行执行多个指令,可以提高性能.例如,根据Agner Fog的指令表 vaddpd 具有Haswell上的等待时间为3,互惠吞吐量为1个时钟周期,这意味着CPU可以并行执行3条指令.甚至可以在其5和0.5个时钟周期内并行执行更多 vmulpd 指令,以实现延迟和互惠吞吐量.

Most SIMD floating point instructions on Haswell and newer CPUs have reciprocal throughput less than its latency, which means the performance could be improved if multiple instructions are executed in parallel. For example, according to Agner Fog's instruction tables, vaddpd has a latency of 3 and reciprocal througput of 1 clock cycles on Haswell, meaning that the CPU can potentially execute 3 instructions in parallel. Even more vmulpd instructions can be executed in parallel with its 5 and 0.5 clock cycles for latency and reciprocal throughput.

您的代码很可能没有利用此指令级并行性(ILP),因为循环体取决于在先前循环迭代中更新的 accumulator 值.这是因为不允许编译器执行许多优化,例如对FP数字进行数学运算的重新排序,因为这可能导致数学上不同的结果.结果,您的算法变得受延迟限制.

Your code likely does not leverage this instruction level parallelism (ILP) because the loop body depends on the accumulator value that was updated on the previous loop iteration. This is because the compiler is not allowed to perform many of the optimizations, such as reordering of math operations on FP numbers, as that could cause mathematically different results. As a result, your algorithm becomes latency-bound.

您可以通过使用特定于编译器的选项(例如用于gcc和兼容编译器的 -ffast-math )来缓解这种情况,但是考虑到这种并行性来重写您的算法更容易.

You could mitigate that by using compiler-specific options, like -ffast-math for gcc and compatible compilers, but it is more portable to just rewrite your algorithm with this parallelism in mind.

我还将在下面的代码中加入其他建议,例如:

I will also incorporate other recommendations in the code below, such as:

  • 修复错误的向量类型,该向量类型应为 __ m256d .
  • 使用专用指令从内存中加载整个向量,而不是依赖编译器优化 _mm256_set_pd .
  • 使用FMA内部函数而不是依赖编译器优化 _mm256_add_pd + _mm256_mul_pd 对.FMA指令减少了计算的等待时间,从而使添加操作有效地免费了.FMA还产生更精确的结果,因为在乘法和加法之间没有舍入.请注意,FMA需要AVX2,仅在支持AVX的CPU中不可用.
  • 使用适当的内在函数从向量中提取最终总和(由于 double 始终会存储在向量寄存器中,因此可能会在最终汇编器中对其进行优化).
  • Fix the incorrect vector type, which should be __m256d.
  • Use dedicated instructions to load the whole vector from memory instead of relying on the compiler optimizing _mm256_set_pd.
  • Use FMA intrinsics instead of relying on the compiler optimizing _mm256_add_pd+_mm256_mul_pd pair. The FMA instructions reduce latency of the calculation, making the addition effectively free. FMA also produces more precise results because there is no rounding between multiplication and addition. Note that FMA requires AVX2, it is not available in CPUs only supporting AVX.
  • Use proper intrinsics to extract the final sum from the vector (which will probably be optimized away in the final assembler since double is stored in a vector register anyway).
void avx2_mul_64_block(double& sum, double* lhs_arr, double* rhs_arr) noexcept
{
    __m256d accumulator1 = _mm256_setzero_pd();
    __m256d accumulator2 = _mm256_setzero_pd();

    for (std::size_t block = 0; block < 64; block += 4 * 2)
    {
        __m256d lhs1 = _mm256_loadu_pd(lhs_arr + block);
        __m256d lhs2 = _mm256_loadu_pd(lhs_arr + block + 4);
        __m256d rhs1 = _mm256_loadu_pd(rhs_arr + block);
        __m256d rhs2 = _mm256_loadu_pd(rhs_arr + block + 4);

        accumulator1 = _mm256_fmadd_pd(lhs1, rhs1, accumulator1);
        accumulator2 = _mm256_fmadd_pd(lhs2, rhs2, accumulator2);
    }

    accumulator1 = _mm256_add_pd(accumulator1, accumulator2);

    __m128d accumulator = _mm_add_pd(_mm256_castpd256_pd128(accumulator1),
        _mm256_extractf128_pd(accumulator1, 1));
    accumulator = _mm_add_pd(accumulator,
        _mm_unpackhi_pd(accumulator, accumulator));

    sum += _mm_cvtsd_f64(accumulator);
}

在上面的代码中,我使用了两个单独的累加器,因此CPU现在能够并行执行两个累加链.进一步增加并行度可能是有益的(请参见上面提到的性能编号),但是如果块长度不能被累加器的数量乘以向量中元素的数量所除,则可能会带来更多问题.您可能需要设置尾部处理,这可能会带来一些轻微的性能开销.

In the code above, I used two separate accumulators, so the CPU is now able to execute two chains of accumulation in parallel. Further increasing parallelism could be beneficial (see the performance numbers mentioned above), but it may be more problematic if the block length is not divisible by the number of accumulators times the number of elements in the vector. You may need to set up tail processing, and that may have some slight performance overhead.

请注意,如前所述,由于算术运算和FMA的顺序不同,因此数学误差的累积也不同,因此该算法可能会产生与原始结果不完全相同的结果.但是,这通常不是问题,特别是对于 double 的高精度.

Note that, as mentioned before, this algorithm may produce results that are not strictly equal to the original because of the different order of arithmetic operations and FMA and hence the different accumulation of the math error. However, this is often not a problem, especially with the high precision of double.

上面的代码中未使用的一些建议:

Some of the suggestions that were not used in the code above:

  • 不使用水平加法( _mm256_hadd_pd )来累加最终金额,因为在当前的Intel(最高为Coffee Lake)和AMD(最高为Zen 2)处理器上, vhadd 指令所花费的等待时间比 vunpckhpd + vaddpd 对稍长,即使后者具有依赖链.这可能会在将来的处理器中发生变化,并且使用水平加法可能会变得有益.不过,在当前CPU中,水平加法可能会有用,以节省一些代码大小.
  • 该代码使用内存中未对齐的负载,而不是 _mm256_load_pd ,因为支持AVX2的现代CPU上的未对齐负载不会造成性能损失,只要负载在运行时实际对齐即可.至少不要越过缓存行边界.跨过缓存行,尤其是页面边界时,会产生开销,但是通常这在现代CPU上不会降低太多性能(
  • Horizontal addition (_mm256_hadd_pd) was not used to accumulate the final sum because on current Intel (up to Coffee Lake) and AMD (up to Zen 2) processors vhadd instruction takes slightly more latency than the vunpckhpd+vaddpd pair, even though the latter has a dependency chain. This may change in future processors, and using horizontal addition may become beneficial. Horizontal addition may be useful in current CPUs to save some code size, though.
  • The code uses unaligned loads from memory, as opposed to _mm256_load_pd because unaligned loads on modern CPUs that support AVX2 don't have a performance penalty as long as the loads are actually aligned in run time or at least don't cross cache line boundary. There is overhead when cache line and especially page boundary is crossed, but usually this still doesn't degrade performance too much on modern CPUs (there is some performance data in this post by Peter Cordes, as well as materials linked there).

这篇关于使用SIMD提高数组浮点点积的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆