使用AVX内部函数在__m512i中求和8位整数 [英] Summing 8-bit integers in __m512i with AVX intrinsics
问题描述
AVX512为我们提供了用于对 __ mm512
向量中的所有单元格求和的内在函数.但是,它们中的一些对应项丢失了:还没有 _mm512_reduce_add_epi8
.
_mm512_reduce_add_ps//16个浮点数的水平和_mm512_reduce_add_pd//8个双打的水平和_mm512_reduce_add_epi32//16个32位整数的水平和_mm512_reduce_add_epi64//8个64位整数的水平和
基本上,我需要在以下代码段中实现 MAGIC
.
__ m512i all_ones = _mm512_set1_epi16(1);short sum_of_ones = MAGIC(all_ones);/*现在sum_of_ones包含32,即32的总和.*/
最明显的方法是使用 _mm512_storeu_epi8
并将数组的元素加在一起,但这很慢,而且可能会使缓存无效.我想有一种更快的方法.
用于实现 _mm512_reduce_add_epi16
的奖励点.
首先, _mm512_reduce_add_epi64
并不与单个AVX512指令相对应,但是会产生一系列改组和添加.>
要将64个 epu8
值减少为8个 epi64
值,通常使用 vpsadbw
指令(零差的绝对差之和),然后可以进一步减小该值:
long reduce_add_epu8(__ m512i a){返回_mm512_reduce_add_epi64(_mm512_sad_epu8(a,_mm512_setzero_si512()));}
尝试使用Godbolt: https://godbolt.org/z/1rMiPH .不幸的是,如果与 _mm512_set1_epi16(1)
一起使用,则GCC和Clang似乎都无法优化该功能.
对于 epi8
而不是 epu8
,您需要先向每个元素添加128(或与 0x80
进行xor),然后使用 vpsadbw
,最后减去 64 * 128
(或在每个中间64bit结果上减去 8 * 128
).[请注意,在此答案的先前版本中这是错误的]
对于 epi16
,我建议看一下 _mm512_reduce_add_epi32
和 _mm512_reduce_add_epi64
的指令,并从中得出要做什么.
总体而言,正如@Mysticial所建议的那样,最佳的减少方法取决于您的上下文.例如,如果您有一个很大的 int64
数组并且想要一个总和为 int64
,则应按数据包的方式将它们加在一起,并且仅在最后减少一个数据包.到单个 int64
.
AVX512 provide us with intrinsics to sum all cells in a __mm512
vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8
, yet.
_mm512_reduce_add_ps //horizontal sum of 16 floats
_mm512_reduce_add_pd //horizontal sum of 8 doubles
_mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers
Basically, I need to implement MAGIC
in the following snippet.
__m512i all_ones = _mm512_set1_epi16(1);
short sum_of_ones = MAGIC(all_ones);
/* now sum_of_ones contains 32, the sum of 32 ones. */
The most obvious way would be using _mm512_storeu_epi8
and sum the elements of the array together, but that would be slow, plus it might invalidate the cache. I suppose there exists a faster approach.
Bonus points for implementing _mm512_reduce_add_epi16
as well.
First of all, _mm512_reduce_add_epi64
does not correspond to a single AVX512 instruction, but it generates a sequence of shuffles and additions.
To reduce 64 epu8
values to 8 epi64
values one usually uses the vpsadbw
instruction (SAD=Sum of Absolute Differences) against a zero vector, which then can be reduced further:
long reduce_add_epu8(__m512i a)
{
return _mm512_reduce_add_epi64(_mm512_sad_epu8(a, _mm512_setzero_si512()));
}
Try it on godbolt: https://godbolt.org/z/1rMiPH. Unfortunately, neither GCC nor Clang seem to be able to optimize away the function if it is used with _mm512_set1_epi16(1)
.
For epi8
instead of epu8
you need to first add 128 to each element (or xor with 0x80
), then reduce it using vpsadbw
and at the end subtract 64*128
(or 8*128
on each intermediate 64bit result). [Note this was wrong in a previous version of this answer]
For epi16
I suggest having a look at what instructions _mm512_reduce_add_epi32
and _mm512_reduce_add_epi64
generate and derive from there what to do.
Overall, as @Mysticial suggested, it depends on your context what the best approach of reducing is. E.g., if you have a very large array of int64
and want a sum as int64
, you should just add them together packet-wise and only at the very end reduce one packet to a single int64
.
这篇关于使用AVX内部函数在__m512i中求和8位整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!