256位AVX向量中32位浮点数的水平和 [英] Horizontal sum of 32-bit floats in 256-bit AVX vector

查看:127
本文介绍了256位AVX向量中32位浮点数的水平和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个浮点数组,我想使用SSE和AVX以尽可能低的延迟来计算点积.我知道浮点数有一个256位固有的乘积,但是我读过SO,它比以下技术慢:( https://stackoverflow.com/a/4121295/997112 ).

I have two arrays of floats and I would like to calculate the dot product, using SSE and AVX, in the lowest latency possible. I am aware there is a 256-bit dot product intrinsic for floats but I have read on SO that this is slower than the below technique: (https://stackoverflow.com/a/4121295/997112).

我已经完成了大部分工作,向量temp_sums包含所有和,我只需要最后对temp_sum中包含的所有八个32位和求和.

I have done most of the work, the vector temp_sums contains all the sums, I just need to sum all the eight 32-bit sums contained within temp_sum at the end.

#include "xmmintrin.h"
#include "immintrin.h"

int main(){
    const int num_elements_in_array = 16;
    __declspec(align(32)) float x[num_elements_in_array];
    __declspec(align(32)) float y[num_elements_in_array];

    x[0] = 2;   x[1] = 2;   x[2] = 2;   x[3] = 2;
    x[4] = 2;   x[5] = 2;   x[6] = 2;   x[7] = 2;
    x[8] = 2;   x[9] = 2;   x[10] = 2;  x[11] = 2;
    x[12] = 2;  x[13] = 2;  x[14] = 2;  x[15] = 2;

    y[0] = 3;   y[1] = 3;   y[2] = 3;   y[3] = 3;
    y[4] = 3;   y[5] = 3;   y[6] = 3;   y[7] = 3;
    y[8] = 3;   y[9] = 3;   y[10] = 3;  y[11] = 3;
    y[12] = 3;  y[13] = 3;  y[14] = 3;  y[15] = 3;

    __m256 a;
    __m256 b;
    __m256 temp_products;   
    __m256 temp_sum = _mm256_setzero_ps();

    unsigned short j = 0;
    const int sse_data_size = 32;
    int num_values_to_process = sse_data_size/sizeof(float);

    while(j < num_elements_in_array){
        a = _mm256_load_ps(x+j);
        b = _mm256_load_ps(y+j);

        temp_products = _mm256_mul_ps(b, a);
        temp_sum = _mm256_add_ps(temp_sum, temp_products);

        j = j + num_values_to_process;
    }

    //Need to "process" temp_sum as a final value here

}

我担心我需要的256位内部函数在AVX 1之前不可用.

I am worried the 256-bit intrinsics I require are not available up to AVX 1.

推荐答案

我建议尽可能使用128位AVX指令.它将减少一次跨域改组的延迟(在Intel Sandy/Ivy Bridge上为2个周期),并提高在128位执行单元(当前为AMD Bulldozer,Piledriver,Steamroller和Jaguar)上运行AVX指令的CPU的效率:/p>

I would suggest to use 128-bit AVX instructions whenever possible. It will reduce the latency of one cross-domain shuffle (2 cycles on Intel Sandy/Ivy Bridge) and improve efficiency on CPUs which run AVX instructions on 128-bit execution units (currently AMD Bulldozer, Piledriver, Steamroller, and Jaguar):

static inline float _mm256_reduce_add_ps(__m256 x) {
    /* ( x3+x7, x2+x6, x1+x5, x0+x4 ) */
    const __m128 x128 = _mm_add_ps(_mm256_extractf128_ps(x, 1), _mm256_castps256_ps128(x));
    /* ( -, -, x1+x3+x5+x7, x0+x2+x4+x6 ) */
    const __m128 x64 = _mm_add_ps(x128, _mm_movehl_ps(x128, x128));
    /* ( -, -, -, x0+x1+x2+x3+x4+x5+x6+x7 ) */
    const __m128 x32 = _mm_add_ss(x64, _mm_shuffle_ps(x64, x64, 0x55));
    /* Conversion to float is a no-op on x86-64 */
    return _mm_cvtss_f32(x32);
}

这篇关于256位AVX向量中32位浮点数的水平和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆