使用__m512的水平添加(AVX512) [英] Horizontal add with __m512 (AVX512)

查看:475
本文介绍了使用__m512的水平添加(AVX512)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何有效地对512位AVX寄存器中的浮点数进行水平加法(即,将单个向量中的项相加)?对于128位和256位寄存器,可以使用_mm_hadd_ps和_mm256_hadd_ps完成,但没有_mm512_hadd_ps.英特尔内部函数指南文档_mm512_reduce_add_ps.它实际上并不对应于一条指令,但是它的存在表明存在一种最佳方法,但是它似乎没有在最新的GCC快照随附的头文件中定义,我无法找到它的定义.它与Google.

How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no _mm512_hadd_ps. The Intel intrinsics guide documents _mm512_reduce_add_ps. It doesn't actually correspond to a single instruction but its existence suggests there is an optimal method, but it doesn't appear to be defined in the header files that come with the latest snapshot of GCC and I can't find a definition for it with Google.

我认为"hadd"可以用_mm512_shuffle_ps和_mm512_add_ps进行仿真,或者我可以使用_mm512_extractf32x4_ps将512位寄存器分为四个128位寄存器,但是我想确保自己没有遗漏更好的东西.

I figure "hadd" can be emulated with _mm512_shuffle_ps and _mm512_add_ps or I could use _mm512_extractf32x4_ps to break a 512-bit register into four 128-bit registers but I want to make sure I'm not missing something better.

推荐答案

INTEL编译器具有以下定义为执行水平和的内在函数

The INTEL compiler has the following intrinsic defined to do horizontal sums

_mm512_reduce_add_ps     //horizontal sum of 16 floats
_mm512_reduce_add_pd     //horizontal sum of 8 doubles
_mm512_reduce_add_epi32  //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64  //horizontal sum of 8 64-bit integers

但是,据我所知,它们还是分成了多条指令,所以我认为除了对AVX512寄存器的上下部分进行水平求和之外,您不会得到更多的收益.

However, as far as I can tell these are broken into multiple instructions anyway so I don't think you gain anything more than doing the horizontal sum of the upper and lower part of the AVX512 register.

__m256 low  = _mm512_castps512_ps256(zmm);
__m256 high = _mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(zmm),1));

__m256d low  = _mm512_castpd512_pd256(zmm);
__m256d high = _mm512_extractf64x4_pd(zmm,1);

__m256i low  = _mm512_castsi512_si256(zmm);
__m256i high = _mm512_extracti64x4_epi64(zmm,1);

要获取水平总和,请执行sum = horizontal_add(low + high).

To get the horizontal sum you then do sum = horizontal_add(low + high).

static inline float horizontal_add (__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);        
}

static inline double horizontal_add (__m256d a) {
    __m256d t1 = _mm256_hadd_pd(a,a);
    __m128d t2 = _mm256_extractf128_pd(t1,1);
    __m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
    return _mm_cvtsd_f64(t3);        
}

我从 Agner Fog的向量类库在线在线英特尔Instrinsics指南.

这篇关于使用__m512的水平添加(AVX512)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆