使用AVX内部函数在__m512i中求和8位整数 [英] Summing 8-bit integers in __m512i with AVX intrinsics

查看:84
本文介绍了使用AVX内部函数在__m512i中求和8位整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

AVX512为我们提供了用于对 __ mm512 向量中的所有单元格求和的内在函数.但是,它们中的一些对应项丢失了:还没有 _mm512_reduce_add_epi8 .

  _mm512_reduce_add_ps//16个浮点数的水平和_mm512_reduce_add_pd//8个双打的水平和_mm512_reduce_add_epi32//16个32位整数的水平和_mm512_reduce_add_epi64//8个64位整数的水平和 

基本上,我需要在以下代码段中实现 MAGIC .

  __ m512i all_ones = _mm512_set1_epi16(1);short sum_of_ones = MAGIC(all_ones);/*现在sum_of_ones包含32,即32的总和.*/ 

最明显的方法是使用 _mm512_storeu_epi8 并将数组的元素加在一起,但这很慢,而且可能会使缓存无效.我想有一种更快的方法.

用于实现 _mm512_reduce_add_epi16 的奖励点.

解决方案

首先, _mm512_reduce_add_epi64 并不与单个AVX512指令相对应,但是会产生一系列改组和添加.>

要将64个 epu8 值减少为8个 epi64 值,通常使用 vpsadbw 指令(零差的绝对差之和),然后可以进一步减小该值:

  long reduce_add_epu8(__ m512i a){返回_mm512_reduce_add_epi64(_mm512_sad_epu8(a,_mm512_setzero_si512()));} 

尝试使用Godbolt: https://godbolt.org/z/1rMiPH .不幸的是,如果与 _mm512_set1_epi16(1)一起使用,则GCC和Clang似乎都无法优化该功能.

对于 epi8 而不是 epu8 ,您需要先向每个元素添加128(或与 0x80 进行xor),然后使用 vpsadbw ,最后减去 64 * 128 (或在每个中间64bit结果上减去 8 * 128 ).[请注意,在此答案的先前版本中这是错误的]

对于 epi16 ,我建议看一下 _mm512_reduce_add_epi32 _mm512_reduce_add_epi64 的指令,并从中得出要做什么.


总体而言,正如@Mysticial所建议的那样,最佳的减少方法取决于您的上下文.例如,如果您有一个很大的 int64 数组并且想要一个总和为 int64 ,则应按数据包的方式将它们加在一起,并且仅在最后减少一个数据包.到单个 int64 .

AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8, yet.

_mm512_reduce_add_ps     //horizontal sum of 16 floats
_mm512_reduce_add_pd     //horizontal sum of 8 doubles
_mm512_reduce_add_epi32  //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64  //horizontal sum of 8 64-bit integers

Basically, I need to implement MAGIC in the following snippet.

__m512i all_ones = _mm512_set1_epi16(1);
short sum_of_ones = MAGIC(all_ones);
/* now sum_of_ones contains 32, the sum of 32 ones. */

The most obvious way would be using _mm512_storeu_epi8 and sum the elements of the array together, but that would be slow, plus it might invalidate the cache. I suppose there exists a faster approach.

Bonus points for implementing _mm512_reduce_add_epi16 as well.

解决方案

First of all, _mm512_reduce_add_epi64 does not correspond to a single AVX512 instruction, but it generates a sequence of shuffles and additions.

To reduce 64 epu8 values to 8 epi64 values one usually uses the vpsadbw instruction (SAD=Sum of Absolute Differences) against a zero vector, which then can be reduced further:

long reduce_add_epu8(__m512i a)
{
    return _mm512_reduce_add_epi64(_mm512_sad_epu8(a, _mm512_setzero_si512()));
}

Try it on godbolt: https://godbolt.org/z/1rMiPH. Unfortunately, neither GCC nor Clang seem to be able to optimize away the function if it is used with _mm512_set1_epi16(1).

For epi8 instead of epu8 you need to first add 128 to each element (or xor with 0x80), then reduce it using vpsadbw and at the end subtract 64*128 (or 8*128 on each intermediate 64bit result). [Note this was wrong in a previous version of this answer]

For epi16 I suggest having a look at what instructions _mm512_reduce_add_epi32 and _mm512_reduce_add_epi64 generate and derive from there what to do.


Overall, as @Mysticial suggested, it depends on your context what the best approach of reducing is. E.g., if you have a very large array of int64 and want a sum as int64, you should just add them together packet-wise and only at the very end reduce one packet to a single int64.

这篇关于使用AVX内部函数在__m512i中求和8位整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆