AVX2:512个浮点数组的计算点积 [英] AVX2: Computing dot product of 512 float arrays

查看:217
本文介绍了AVX2:512个浮点数组的计算点积的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我将是SIMD内部函数的一个完整的初学者.

I will preface this by saying that I am a complete beginner at SIMD intrinsics.

从本质上讲,我有一个支持AVX2内部(Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz)的CPU.我想知道计算大小为512的两个std::vector<float>的点积的最快方法.

Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.

我已经在网上进行了一些挖掘,发现堆栈溢出问题建议使用下面的函数__m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);,但是,所有这些都提出了执行点积的不同方法,我不确定什么是正确(且最快)的方法.

I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.

尤其是,我正在寻找对尺寸为512的矢量执行点积运算的最快方法(因为我知道矢量尺寸会影响实现).

In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).

谢谢您的帮助

编辑1 : 我对-mavx2 gcc标志也有些困惑.如果使用这些AVX2函数,在编译时是否需要添加标志?另外,如果我编写朴素的点乘产品实现,gcc是否可以为我做这些优化(例如是否使用-OFast gcc标志)?

Edit 1: I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?

编辑2 如果有人有时间和精力,那么如果您能编写一个完整的实现,我将不胜感激.我相信其他初学者也会重视此信息.

Edit 2 If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.

推荐答案

_mm256_dp_ps仅适用于2到4个元素的点积.对于更长的向量,请在循环中使用垂直SIMD,最后减少为标量.循环使用_mm256_dp_ps_mm256_add_ps会慢得多.

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.

与MSVC和ICC不同,GCC和clang要求您启用(使用命令行选项)使用内部函数的ISA扩展.

GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.

下面的代码可能接近您的CPU的理论性能极限.未经测试.

The code below is probably close to theoretical performance limit of your CPU. Untested.

使用clang或gcc -O3 -march=native进行编译. (至少需要-mavx -mfma,但-march所隐含的-mtune选项也很好,其他-mpopcnt和其他arch=native启用的功能也是如此.调整选项对于大多数CPU进行有效的编译至关重要与FMA,特别是-mno-avx256-split-unaligned-load:为什么不gcc将_mm256_loadu_pd解析为单个vmovupd?)

Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)

或使用MSVC -O2 -arch:AVX2

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a, b );
}

// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet, just multiplying.
    __m256 dot0 = mul8<0>( p1, p2 );
    __m256 dot1 = mul8<1>( p1, p2 );
    __m256 dot2 = mul8<2>( p1, p2 );
    __m256 dot3 = mul8<3>( p1, p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

可能的进一步改进:

  1. 以8个向量(而不是4个)展开.我检查了 gcc 9.2 asm输出 ,编译器只使用了16个可用向量寄存器中的8个.

  1. Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.

确保两个输入向量对齐,例如使用自定义分配器,该分配器在msvc上调用_aligned_malloc/_aligned_free或在gcc&上调用aligned_alloc/free铛.然后用_mm256_load_ps替换_mm256_loadu_ps.

Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.


要自动向量化一个简单的标量点积,您还需要OpenMP SIMD或-ffast-math(由表示),以使编译器将FP数学视为关联,即使不是(因为四舍五入).但是,即使自动展开,GCC也不会使用多个累加器,因此您会遇到FMA延迟而不是吞吐量的瓶颈.


To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.

(每个FMA 2个负载意味着此代码的吞吐量瓶颈是矢量负载,而不是实际的FMA操作.)

(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)

这篇关于AVX2:512个浮点数组的计算点积的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆