将带有uint16索引的AVX2字节收集到__m256i中 [英] AVX2 byte gather with uint16 indices, into a __m256i

查看:248
本文介绍了将带有uint16索引的AVX2字节收集到__m256i中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将__m256i变量与数组中的32个字符组成并由索引指定.这是我的代码:

  char数组[];//每次都使用不同的数组.uint16_t偏移量[32];//多次重复使用相同的偏移量_mm256_set_epi8(array [offset [0]],array [offset [1]],array [offset [2]],array [offset [3]],array [offset [4]],array [offset [5]],array [offset [6]],array [offset [7]],数组[offset [8]],数组[offset [9]],数组[offset [10]],数组[offset [11]],数组[offset [12]],数组[offset [13]],数组[offset [14]],array [offset [15]],数组[offset [16]],数组[offset [17]],数组[offset [18]],数组[offset [19]],数组[offset [20]],数组[offset [21]],数组[偏移量[22]],数组[偏移量[23]],数组[offset [24]],数组[offset [25]],数组[offset [26]],数组[offset [27]],数组[offset [28]],数组[offset [29]],数组[offset [30]],array [offset [31]]) 

将使用相同的偏移量和不同的数组多次调用此函数.但是根据我的测试,我认为它不是最佳的.有什么想法可以改善吗?

解决方案

让我们首先看一下适用于一般 offset (随每次调用而变化)的解决方案(这将是一个临时解决方案)现有功能),然后看看是否可以利用用于多个调用的同一个 offset 数组(而 array 总是变化的).

变化偏移量

一个大问题是 gcc (旧的或新的)仅生成糟糕的代码用于您的函数的当前实现:

  lea r10,[rsp + 8]和rsp,-32推送QWORD PTR [r10-8]推送RBPmov rbp,rsp推r15推r14推r13推r12推r10推rbxsub rsp,40岁movzx eax,WORD PTR [rsi + 40]movzx r14d,WORD PTR [rsi + 60]movzx r12d,WORD PTR [rsi + 56]movzx ecx,WORD PTR [rsi + 44]movzx r15d,WORD PTR [rsi + 62]movzx r13d,WORD PTR [rsi + 58]mov QWORD PTR [rbp-56],raxmovzx eax,WORD PTR [rsi + 38]movzx ebx,WORD PTR [rsi + 54]movzx r11d,WORD PTR [rsi + 52]movzx r10d,WORD PTR [rsi + 50]movzx r9d,WORD PTR [rsi + 48]movzx r8d,WORD PTR [rsi + 46]mov QWORD PTR [rbp-64],raxmovzx eax,WORD PTR [rsi + 36]movzx edx,WORD PTR [rsi + 42]mov QWORD PTR [rbp-72],raxmovzx eax,WORD PTR [rsi + 34]mov QWORD PTR [rbp-80],raxmovzx eax,WORD PTR [rsi + 32]mov QWORD PTR [rbp-88],raxmovzx eax,WORD PTR [rsi + 30]movzx r15d,BYTE PTR [rdi + r15]mov QWORD PTR [rbp-96],raxmovzx eax,WORD PTR [rsi + 28]vmovd xmm2,r15dvpinsrb xmm2,xmm2,BYTE PTR [rdi + r14],1mov QWORD PTR [rbp-104],raxmovzx eax,WORD PTR [rsi + 26]mov QWORD PTR [rbp-112],raxmovzx eax,WORD PTR [rsi + 24]mov QWORD PTR [rbp-120],raxmovzx eax,WORD PTR [rsi + 22]mov QWORD PTR [rbp-128],raxmovzx eax,WORD PTR [rsi + 20]mov QWORD PTR [rbp-136],raxmovzx eax,WORD PTR [rsi + 18]mov QWORD PTR [rbp-144],raxmovzx eax,WORD PTR [rsi + 16]mov QWORD PTR [rbp-152],raxmovzx eax,WORD PTR [rsi + 14]mov QWORD PTR [rbp-160],raxmovzx eax,WORD PTR [rsi + 12]mov QWORD PTR [rbp-168],raxmovzx eax,WORD PTR [rsi + 10]mov QWORD PTR [rbp-176],raxmovzx eax,WORD PTR [rsi + 8]mov QWORD PTR [rbp-184],raxmovzx eax,WORD PTR [rsi + 6]mov QWORD PTR [rbp-192],raxmovzx eax,WORD PTR [rsi + 4]mov QWORD PTR [rbp-200],raxmovzx eax,WORD PTR [rsi + 2]movzx esi,WORD PTR [rsi]movzx r13d,BYTE PTR [rdi + r13]movzx r8d,BYTE PTR [rdi + r8]movzx edx,BYTE PTR [rdi + rdx]movzx ebx,BYTE PTR [rdi + rbx]movzx r10d,BYTE PTR [rdi + r10]vmovd xmm7,r13dvmovd xmm1,r8dvpinsrb xmm1,xmm1,BYTE PTR [rdi + rcx],1mov rcx,QWORD PTR [rbp-56]vmovd xmm5,edxvmovd xmm3,ebxmov rbx,QWORD PTR [rbp-72]vmovd xmm6,r10dvpinsrb xmm7,xmm7,BYTE PTR [rdi + r12],1vpinsrb xmm5,xmm5,BYTE PTR [rdi + rcx],1mov rcx,QWORD PTR [rbp-64]vpinsrb xmm6,xmm6,BYTE PTR [rdi + r9],1vpinsrb xmm3,xmm3,BYTE PTR [rdi + r11],1vpunpcklwd xmm2,xmm2,xmm7movzx edx,BYTE PTR [rdi + rcx]mov rcx,QWORD PTR [rbp-80]vpunpcklwd xmm1,xmm1,xmm5vpunpcklwd xmm3,xmm3,xmm6vmovd xmm0,edxmovzx edx,BYTE PTR [rdi + rcx]mov rcx,QWORD PTR [rbp-96]vpunpckldq xmm2,xmm2,xmm3vpinsrb xmm0,xmm0,BYTE PTR [rdi + rbx],1mov rbx,QWORD PTR [rbp-88]vmovd xmm4,edxmovzx edx,BYTE PTR [rdi + rcx]mov rcx,QWORD PTR [rbp-112]vpinsrb xmm4,xmm4,BYTE PTR [rdi + rbx],1mov rbx,QWORD PTR [rbp-104]vpunpcklwd xmm0,xmm0,xmm4vpunpckldq xmm0,xmm1,xmm0vmovd xmm1,edxmovzx edx,BYTE PTR [rdi + rcx]vpinsrb xmm1,xmm1,BYTE PTR [rdi + rbx],1mov rcx,QWORD PTR [rbp-128]mov rbx,QWORD PTR [rbp-120]vpunpcklqdq xmm0,xmm2,xmm0vmovd xmm8,edxmovzx edx,BYTE PTR [rdi + rcx]vpinsrb xmm8,xmm8,BYTE PTR [rdi + rbx],1mov rcx,QWORD PTR [rbp-144]mov rbx,QWORD PTR [rbp-136]vmovd xmm4,edxvpunpcklwd xmm1,xmm1,xmm8vpinsrb xmm4,xmm4,BYTE PTR [rdi + rbx],1movzx edx,BYTE PTR [rdi + rcx]mov rbx,QWORD PTR [rbp-152]mov rcx,QWORD PTR [rbp-160]vmovd xmm7,edxmovzx eax,BYTE PTR [rdi + rax]movzx edx,BYTE PTR [rdi + rcx]vpinsrb xmm7,xmm7,BYTE PTR [rdi + rbx],1mov rcx,QWORD PTR [rbp-176]mov rbx,QWORD PTR [rbp-168]vmovd xmm5,eaxvmovd xmm2,edxvpinsrb xmm5,xmm5,BYTE PTR [rdi + rsi],1vpunpcklwd xmm4,xmm4,xmm7movzx edx,BYTE PTR [rdi + rcx]vpinsrb xmm2,xmm2,BYTE PTR [rdi + rbx],1vpunpckldq xmm1,xmm1,xmm4mov rbx,QWORD PTR [rbp-184]mov rcx,QWORD PTR [rbp-192]vmovd xmm6,edxmovzx edx,BYTE PTR [rdi + rcx]vpinsrb xmm6,xmm6,BYTE PTR [rdi + rbx],1mov rbx,QWORD PTR [rbp-200]vmovd xmm3,edxvpunpcklwd xmm2,xmm2,xmm6vpinsrb xmm3,xmm3,BYTE PTR [rdi + rbx],1添加rsp,40vpunpcklwd xmm3,xmm3,xmm5vpunpckldq xmm2,xmm2,xmm3流行rbx流行音乐r10vpunpcklqdq xmm1,xmm1,xmm2流行音乐r12流行音乐r13vinserti128 ymm0,ymm0,xmm1、0x1流行音乐r14流行音乐r15流行音乐lea rsp,[r10-8]退回 

基本上,它试图对所有 offset 进行读取,并且只是用完了寄存器,所以它开始溢出一些,然后在只是读取大部分内容的地方进行了一次狂欢. offset 的16位元素,然后立即将它们(作为64位零扩展值)立即存储到堆栈中.本质上,它无缘无故地复制了大多数 offset 数组(零扩展到64位):在以后读取的溢出值中,当然可以从 offset .

在您使用的旧版 4.9.2 以及最新版本的 7.2 中,这同样可怕的代码也很明显.


icc clang 都没有任何此类问题-它们都生成几乎相同的相当合理的代码,而这些代码只需使用来从每个 offset 位置读取一次 movzx ,然后使用 vpinsrb 和基于刚刚读取的 offset 的内存源操作数插入字节:

  gather256(char *,unsigned short *):#@ gather256(char *,unsigned short *)movzx eax,单词ptr [rsi + 30]movzx eax,字节ptr [rdi + rax]vmovd xmm0,eaxmovzx eax,单词ptr [rsi + 28]vpinsrb xmm0,xmm0,字节ptr [rdi + rax],1movzx eax,单词ptr [rsi + 26]vpinsrb xmm0,xmm0,字节ptr [rdi + rax],2movzx eax,单词ptr [rsi + 24]...vpinsrb xmm0,xmm0,字节ptr [rdi + rax],14movzx eax,单词ptr [rsi]vpinsrb xmm0,xmm0,字节ptr [rdi + rax],15movzx eax,单词ptr [rsi + 62]movzx eax,字节ptr [rdi + rax]vmovd xmm1,eaxmovzx eax,单词ptr [rsi + 60]vpinsrb xmm1,xmm1,字节ptr [rdi + rax],1movzx eax,单词ptr [rsi + 58]vpinsrb xmm1,xmm1,字节ptr [rdi + rax],2movzx eax,单词ptr [rsi + 56]vpinsrb xmm1,xmm1,字节ptr [rdi + rax],3movzx eax,单词ptr [rsi + 54]vpinsrb xmm1,xmm1,字节ptr [rdi + rax],4movzx eax,单词ptr [rsi + 52]...movzx eax,单词ptr [rsi + 32]vpinsrb xmm1,xmm1,字节ptr [rdi + rax],15vinserti128 ymm0,ymm1,xmm0、1退回 

非常好. vinserti128 两个 xmm 向量各占一半,这会带来少量额外开销,这显然是因为 vpinserb 无法写入高128位.看起来在像您这样使用的现代uarch上,每个周期1个元素的2个读取端口和5端口(随机播放)同时出现瓶颈.因此,这可能具有大约每32个周期1个的吞吐量,并且延迟接近32个周期(主要的依赖链是通过正在进行的 xmm 寄存器来接收pinrb ,但是此指令的内存源版本列出的延迟仅为1个周期 1 .

在gcc上,我们能否接近32种性能?好像是这是一种方法:

  uint64_t collect64(char * array,uint16_t * offset){uint64_t ret;char * p =(char *)& ret;p [0] = array [offset [0]];p [1] = array [offset [1]];p [2] = array [offset [2]];p [3] = array [offset [3]];p [4] = array [offset [4]];p [5] = array [offset [5]];p [6] = array [offset [6]];p [7] = array [offset [7]];返回ret}__m256i collect256_gcc(char * array,uint16_t * offset){返回_mm256_set_epi64x(collect64(数组,偏移量),collect64(数组+ 8,偏移+ 8),collect64(数组+ 16,偏移+ 16),collect64(数组+ 24,偏移量+ 24));} 

这里,我们依靠堆栈上的临时数组一次从 array 收集8个元素,然后将其用作 _mm256_set_epi64x 的输入.总体而言,这每个8字节元素使用2个负载和1个存储,并且每个64位元素使用几个额外的指令,因此每个元素吞吐量 2 应该接近1个周期.

它会在 gcc 中生成预期的"内联代码:

  gather256_gcc(char *,unsigned short *):lea r10,[rsp + 8]和rsp,-32推送QWORD PTR [r10-8]推送RBPmov rbp,rsp推r10movzx eax,WORD PTR [rsi + 48]movzx eax,BYTE PTR [rdi + 24 + rax]mov BYTE PTR [rbp-24]等movzx eax,WORD PTR [rsi + 50]movzx eax,BYTE PTR [rdi + 24 + rax]mov BYTE PTR [rbp-23]等movzx eax,WORD PTR [rsi + 52]movzx eax,BYTE PTR [rdi + 24 + rax]mov BYTE PTR [rbp-22],al...movzx eax,WORD PTR [rsi + 62]movzx eax,BYTE PTR [rdi + 24 + rax]mov BYTE PTR [rbp-17]等movzx eax,WORD PTR [rsi + 32]vmovq xmm0,QWORD PTR [rbp-24]movzx eax,BYTE PTR [rdi + 16 + rax]movzx edx,WORD PTR [rsi + 16]mov BYTE PTR [rbp-24]等movzx eax,WORD PTR [rsi + 34]movzx edx,BYTE PTR [rdi + 8 + rdx]movzx eax,BYTE PTR [rdi + 16 + rax]mov BYTE PTR [rbp-23]等...movzx eax,WORD PTR [rsi + 46]movzx eax,BYTE PTR [rdi + 16 + rax]mov BYTE PTR [rbp-17]等mov rax,QWORD PTR [rbp-24]mov BYTE PTR [rbp-24],dlmovzx edx,WORD PTR [rsi + 18]vpinsrq xmm0,xmm0,rax,1movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-23],dlmovzx edx,WORD PTR [rsi + 20]movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-22],dlmovzx edx,WORD PTR [rsi + 22]movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-21],dlmovzx edx,WORD PTR [rsi + 24]movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-20],dlmovzx edx,WORD PTR [rsi + 26]movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-19],dlmovzx edx,WORD PTR [rsi + 28]movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-18],dlmovzx edx,WORD PTR [rsi + 30]movzx edx,BYTE PTR [rdi + 8 + rdx]mov BYTE PTR [rbp-17],dlmovzx edx,WORD PTR [rsi]vmovq xmm1,QWORD PTR [rbp-24]movzx edx,BYTE PTR [rdi + rdx]mov BYTE PTR [rbp-24],dlmovzx edx,WORD PTR [rsi + 2]movzx edx,BYTE PTR [rdi + rdx]mov BYTE PTR [rbp-23],dlmovzx edx,WORD PTR [rsi + 4]movzx edx,BYTE PTR [rdi + rdx]mov BYTE PTR [rbp-22],dl...movzx edx,WORD PTR [rsi + 12]movzx edx,BYTE PTR [rdi + rdx]mov BYTE PTR [rbp-18],dlmovzx edx,WORD PTR [rsi + 14]movzx edx,BYTE PTR [rdi + rdx]mov BYTE PTR [rbp-17],dlvpinsrq xmm1,xmm1,QWORD PTR [rbp-24],1vinserti128 ymm0,ymm0,xmm1、0x1流行音乐r10流行音乐lea rsp,[r10-8]退回 

这种方法在尝试读取堆栈缓冲区时将遇到4个(非相关的)存储转发停顿,这会使延迟比32个周期差一些,也许是在40年代中期(如果您认为这是最后一个停顿的情况)将不会被隐藏).您也可以只删除 gather64 函数,然后将整个内容展开到32字节的缓冲区中,最后只加载一次.这样只会导致一个停顿,并且省去了将每个64位值一次加载到结果中的小开销,但是总体效果可能会更糟,因为较大的负载有时会遇到更大的转发停顿.>

我很确定您可以提出更好的方法.例如,您可以使用clang和icc使用的 vpinsrb 方法在内部函数中写出长手".这很简单, gcc 应该正确.


重复偏移量

如果将 offset 数组重复用于几个不同的 array 输入怎么办?

我们可以看一下对 offset 数组的预处理,以便我们的核心加载循环可以更快.

一种可行的方法是使用 vgatherdd 有效地加载元素,而不会在端口5上造成洗牌的瓶颈.我们也可以在单个256位加载中加载整个聚集索引向量.不幸的是,最细粒度的 vpgather vpgatherdd ,它使用32位偏移量加载8个32位元素.因此,我们需要将其中的4个集合全部获取32个字节元素,然后需要以某种方式混合结果向量.

我们实际上可以通过交织和调整偏移量来避免合并结果数组的大部分开销,从而使每个32位值中的目标"字节实际上是其正确的最终位置.因此,您最终得到了4个256位向量,每个向量在正确的位置具有您想要的8个字节,而没有您想要的24个字节.您可以将 vpblendw 两对向量放在一起,然后将 vpblendb 这些结果放在一起,总共3个端口5 uops(必须有一种更好的方法来进行这种减少?).

将它们全部加在一起,我得到类似的东西:

  • 4个 movups 来加载4个vpgatherdd索引regs(可以吊起)
  • 4个 vpgatherdd
  • 2 vpblendw (4个结果-> 2)
  • 1个 movups 来加载 vpblendb 蒙版(可以吊起)
  • 1个 vpblendb (2个结果-> 1个)

除了 vpgatherdd 之外,它看起来像9块,其中3个进入端口5,因此该端口上有3个周期成为瓶颈,如果没有瓶颈,则大约为2.25个周期(因为 vpgatherdd 可能不使用端口5).在Broadwell上, vpgather 系列相对于Haswell有了很大的改进,但是对于 vpgatherdd 来说,每个元素仍然需要约0.9个周期,因此大约有29个周期.因此,我们回到了大约32个周期的起点.

仍然有一些希望:

  • 每个元素0.9个周期主要用于纯 vpgatherdd 活动.也许那时混合代码或多或少是免费的,而我们大约有29个周期(实际上, movups 仍将与聚会竞争).
  • vpgatherdd 在Skylake中又有了更好的改进,每个元素大约有0.6个周期,因此,当您将硬件升级到Skylake时,此策略将开始提供很大帮助.(而且该策略可能会比使用AVX512BW的 vpinsrb 稍远一些,在这种情况下,将字节与 k -寄存器掩码混合是有效的,而 vpgatherdd zmm 每个元素的聚集吞吐量略高于 ymm ( InstLatx64 ).
  • 通过预处理,您可以检查是否正在从 array 中读取重复的元素.在这种情况下,您可能会减少收集次数.例如,如果 offset 中只有一半的元素是唯一的,则只能进行两次收集以收集16个元素,然后根据需要 pshufb 注册以复制元素.还原"必须更笼统,但实际上似乎并不昂贵(可能更便宜),因为 pshufb 相当笼统地完成了大部分工作.

扩展最后一个想法:您将在运行时调度到一个例程,该例程根据需要多少元素来知道如何进行1、2、3或4个聚集.这是相当量化的,但是您总是可以在这些截止点之间以更细粒度的方式分配标量负载(或使用较大的元素进行收集,这些元素更快).您很快就会发现收益递减.

您甚至可以将其扩展为处理 nearby 元素-毕竟,您要抓取4个字节来获取一个字节:因此,如果这3个浪费的字节中的任何一个实际上位于另一个使用的偏移量处值,那么您几乎可以免费获得它.现在,这需要一个更一般的还原阶段,但 pshufb 似乎仍然可以完成繁重的工作,并且大多数艰苦的工作仅限于预处理.


1 这是少数SSE/AVX指令之一,该指令的内存源形式比reg-reg形式要高效得多:reg-reg形式需要2在端口5上增加uops,将其限制为每个周期0.5的吞吐量,并使其延迟为2.显然,内存加载路径避免了端口5所需的混洗/混合之一. vpbroadcastd/q 也是如此.

2 每个周期有两个负载和一个存储,这将在最大理论性能的参差不齐的边缘运行:它使L1操作吞吐量最大化,这通常会导致打h:例如,可能没有任何空闲周期来接受来自L2的传入缓存行.

I am trying to pack a __m256i variable with 32 chars from an array and specified by indices. here is my code:

char array[];         // different array every time.
uint16_t offset[32];  // same offset reused many times


_mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]], array[offset[7]],
      array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]], array[offset[12]], array[offset[13]], array[offset[14]], array[offset[15]], 
      array[offset[16]],array[offset[17]], array[offset[18]], array[offset[19]], array[offset[20]], array[offset[21]], array[offset[22]], array[offset[23]], 
      array[offset[24]],array[offset[25]],array[offset[26]], array[offset[27]], array[offset[28]], array[offset[29]], array[offset[30]],array[offset[31]])

This function will be called many times with the same offsets and different arrays. But I don't think it is optimal according to my test. Is there any idea to improve it?

解决方案

Let's look first at solutions that work for a general offset that varies with every call (which will be a drop-in solution for the existing function), and then after we'll see if we can take advantage of the same offset array being used used for several calls (while array always varies).

Varying Offset

Well one big problem is that gcc (old or new) just generates awful code for the current implementation of your function:

  lea r10, [rsp+8]
  and rsp, -32
  push QWORD PTR [r10-8]
  push rbp
  mov rbp, rsp
  push r15
  push r14
  push r13
  push r12
  push r10
  push rbx
  sub rsp, 40
  movzx eax, WORD PTR [rsi+40]
  movzx r14d, WORD PTR [rsi+60]
  movzx r12d, WORD PTR [rsi+56]
  movzx ecx, WORD PTR [rsi+44]
  movzx r15d, WORD PTR [rsi+62]
  movzx r13d, WORD PTR [rsi+58]
  mov QWORD PTR [rbp-56], rax
  movzx eax, WORD PTR [rsi+38]
  movzx ebx, WORD PTR [rsi+54]
  movzx r11d, WORD PTR [rsi+52]
  movzx r10d, WORD PTR [rsi+50]
  movzx r9d, WORD PTR [rsi+48]
  movzx r8d, WORD PTR [rsi+46]
  mov QWORD PTR [rbp-64], rax
  movzx eax, WORD PTR [rsi+36]
  movzx edx, WORD PTR [rsi+42]
  mov QWORD PTR [rbp-72], rax
  movzx eax, WORD PTR [rsi+34]
  mov QWORD PTR [rbp-80], rax
  movzx eax, WORD PTR [rsi+32]
  mov QWORD PTR [rbp-88], rax
  movzx eax, WORD PTR [rsi+30]
  movzx r15d, BYTE PTR [rdi+r15]
  mov QWORD PTR [rbp-96], rax
  movzx eax, WORD PTR [rsi+28]
  vmovd xmm2, r15d
  vpinsrb xmm2, xmm2, BYTE PTR [rdi+r14], 1
  mov QWORD PTR [rbp-104], rax
  movzx eax, WORD PTR [rsi+26]
  mov QWORD PTR [rbp-112], rax
  movzx eax, WORD PTR [rsi+24]
  mov QWORD PTR [rbp-120], rax
  movzx eax, WORD PTR [rsi+22]
  mov QWORD PTR [rbp-128], rax
  movzx eax, WORD PTR [rsi+20]
  mov QWORD PTR [rbp-136], rax
  movzx eax, WORD PTR [rsi+18]
  mov QWORD PTR [rbp-144], rax
  movzx eax, WORD PTR [rsi+16]
  mov QWORD PTR [rbp-152], rax
  movzx eax, WORD PTR [rsi+14]
  mov QWORD PTR [rbp-160], rax
  movzx eax, WORD PTR [rsi+12]
  mov QWORD PTR [rbp-168], rax
  movzx eax, WORD PTR [rsi+10]
  mov QWORD PTR [rbp-176], rax
  movzx eax, WORD PTR [rsi+8]
  mov QWORD PTR [rbp-184], rax
  movzx eax, WORD PTR [rsi+6]
  mov QWORD PTR [rbp-192], rax
  movzx eax, WORD PTR [rsi+4]
  mov QWORD PTR [rbp-200], rax
  movzx eax, WORD PTR [rsi+2]
  movzx esi, WORD PTR [rsi]
  movzx r13d, BYTE PTR [rdi+r13]
  movzx r8d, BYTE PTR [rdi+r8]
  movzx edx, BYTE PTR [rdi+rdx]
  movzx ebx, BYTE PTR [rdi+rbx]
  movzx r10d, BYTE PTR [rdi+r10]
  vmovd xmm7, r13d
  vmovd xmm1, r8d
  vpinsrb xmm1, xmm1, BYTE PTR [rdi+rcx], 1
  mov rcx, QWORD PTR [rbp-56]
  vmovd xmm5, edx
  vmovd xmm3, ebx
  mov rbx, QWORD PTR [rbp-72]
  vmovd xmm6, r10d
  vpinsrb xmm7, xmm7, BYTE PTR [rdi+r12], 1
  vpinsrb xmm5, xmm5, BYTE PTR [rdi+rcx], 1
  mov rcx, QWORD PTR [rbp-64]
  vpinsrb xmm6, xmm6, BYTE PTR [rdi+r9], 1
  vpinsrb xmm3, xmm3, BYTE PTR [rdi+r11], 1
  vpunpcklwd xmm2, xmm2, xmm7
  movzx edx, BYTE PTR [rdi+rcx]
  mov rcx, QWORD PTR [rbp-80]
  vpunpcklwd xmm1, xmm1, xmm5
  vpunpcklwd xmm3, xmm3, xmm6
  vmovd xmm0, edx
  movzx edx, BYTE PTR [rdi+rcx]
  mov rcx, QWORD PTR [rbp-96]
  vpunpckldq xmm2, xmm2, xmm3
  vpinsrb xmm0, xmm0, BYTE PTR [rdi+rbx], 1
  mov rbx, QWORD PTR [rbp-88]
  vmovd xmm4, edx
  movzx edx, BYTE PTR [rdi+rcx]
  mov rcx, QWORD PTR [rbp-112]
  vpinsrb xmm4, xmm4, BYTE PTR [rdi+rbx], 1
  mov rbx, QWORD PTR [rbp-104]
  vpunpcklwd xmm0, xmm0, xmm4
  vpunpckldq xmm0, xmm1, xmm0
  vmovd xmm1, edx
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm1, xmm1, BYTE PTR [rdi+rbx], 1
  mov rcx, QWORD PTR [rbp-128]
  mov rbx, QWORD PTR [rbp-120]
  vpunpcklqdq xmm0, xmm2, xmm0
  vmovd xmm8, edx
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm8, xmm8, BYTE PTR [rdi+rbx], 1
  mov rcx, QWORD PTR [rbp-144]
  mov rbx, QWORD PTR [rbp-136]
  vmovd xmm4, edx
  vpunpcklwd xmm1, xmm1, xmm8
  vpinsrb xmm4, xmm4, BYTE PTR [rdi+rbx], 1
  movzx edx, BYTE PTR [rdi+rcx]
  mov rbx, QWORD PTR [rbp-152]
  mov rcx, QWORD PTR [rbp-160]
  vmovd xmm7, edx
  movzx eax, BYTE PTR [rdi+rax]
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm7, xmm7, BYTE PTR [rdi+rbx], 1
  mov rcx, QWORD PTR [rbp-176]
  mov rbx, QWORD PTR [rbp-168]
  vmovd xmm5, eax
  vmovd xmm2, edx
  vpinsrb xmm5, xmm5, BYTE PTR [rdi+rsi], 1
  vpunpcklwd xmm4, xmm4, xmm7
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm2, xmm2, BYTE PTR [rdi+rbx], 1
  vpunpckldq xmm1, xmm1, xmm4
  mov rbx, QWORD PTR [rbp-184]
  mov rcx, QWORD PTR [rbp-192]
  vmovd xmm6, edx
  movzx edx, BYTE PTR [rdi+rcx]
  vpinsrb xmm6, xmm6, BYTE PTR [rdi+rbx], 1
  mov rbx, QWORD PTR [rbp-200]
  vmovd xmm3, edx
  vpunpcklwd xmm2, xmm2, xmm6
  vpinsrb xmm3, xmm3, BYTE PTR [rdi+rbx], 1
  add rsp, 40
  vpunpcklwd xmm3, xmm3, xmm5
  vpunpckldq xmm2, xmm2, xmm3
  pop rbx
  pop r10
  vpunpcklqdq xmm1, xmm1, xmm2
  pop r12
  pop r13
  vinserti128 ymm0, ymm0, xmm1, 0x1
  pop r14
  pop r15
  pop rbp
  lea rsp, [r10-8]
  ret

Basically it's trying to do all the reads of offset up front, and just runs out of registers, so it starts spilling a few and then goes on an orgy of spilling where it's just reading most of the 16-bit elements of offset and then immediately storing them (as 64-bit zero-extended values) immediately on to the stack. Essentially it's copying most of the offset array (with zero extension to 64-bits) for no purpose: where it later reads the spilled values it could have of course just read from offset.

This same terrible code is evident in the old 4.9.2 version you're using as well as the very recent 7.2.


Neither icc nor clang have any such issues - they both generate almost identical quite reasonable code that just reads once from every offset position using movzx and then inserts the byte using vpinsrb with a memory source operand based on the offset just read:

gather256(char*, unsigned short*): # @gather256(char*, unsigned short*)
  movzx eax, word ptr [rsi + 30]
  movzx eax, byte ptr [rdi + rax]
  vmovd xmm0, eax
  movzx eax, word ptr [rsi + 28]
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 1
  movzx eax, word ptr [rsi + 26]
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 2
  movzx eax, word ptr [rsi + 24]
  ...
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 14
  movzx eax, word ptr [rsi]
  vpinsrb xmm0, xmm0, byte ptr [rdi + rax], 15
  movzx eax, word ptr [rsi + 62]
  movzx eax, byte ptr [rdi + rax]
  vmovd xmm1, eax
  movzx eax, word ptr [rsi + 60]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 1
  movzx eax, word ptr [rsi + 58]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 2
  movzx eax, word ptr [rsi + 56]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 3
  movzx eax, word ptr [rsi + 54]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 4
  movzx eax, word ptr [rsi + 52]
  ...
  movzx eax, word ptr [rsi + 32]
  vpinsrb xmm1, xmm1, byte ptr [rdi + rax], 15
  vinserti128 ymm0, ymm1, xmm0, 1
  ret

Very nice. There is a small amount of additional overhead to vinserti128 two xmm vectors together each with half of the result, apparently because vpinserb can't write to the high 128-bits. It seems that on modern uarchs like the one you are using this would simultaneously bottleneck on the 2 read ports and port 5 (shuffle) at 1 element per cycle. So this will probably have a throughput of about 1 per 32 cycles, and a latency close to 32 cycles (the main dependence chain is through the work-in-progress xmm register that is receiving the pinsrb but the listed latency for the memory-source version of this instruction is only 1 cycle1.

Can we get close to this 32 performance on gcc? It seems so. Here's one approach:

uint64_t gather64(char *array, uint16_t *offset) {
  uint64_t ret;
  char *p = (char *)&ret;
  p[0] = array[offset[0]];
  p[1] = array[offset[1]];
  p[2] = array[offset[2]];
  p[3] = array[offset[3]];
  p[4] = array[offset[4]];
  p[5] = array[offset[5]];
  p[6] = array[offset[6]];
  p[7] = array[offset[7]];
  return ret;
}

__m256i gather256_gcc(char *array, uint16_t *offset) {

  return _mm256_set_epi64x(
    gather64(array, offset),
    gather64(array +  8, offset + 8),
    gather64(array + 16, offset + 16),
    gather64(array + 24, offset + 24)
  );
}

Here we rely on a temporary array on the stack to gather 8 elements from array at a time, and then we use that as input into _mm256_set_epi64x. Overall this uses 2 loads and 1 store per 8-byte element, and a couple extra instructions for every 64-bit element, so it should be close to 1 cycle per element throughput2.

It generates the "expected" inlined code in gcc:

gather256_gcc(char*, unsigned short*):
  lea r10, [rsp+8]
  and rsp, -32
  push QWORD PTR [r10-8]
  push rbp
  mov rbp, rsp
  push r10
  movzx eax, WORD PTR [rsi+48]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-24], al
  movzx eax, WORD PTR [rsi+50]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-23], al
  movzx eax, WORD PTR [rsi+52]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-22], al
  ...
  movzx eax, WORD PTR [rsi+62]
  movzx eax, BYTE PTR [rdi+24+rax]
  mov BYTE PTR [rbp-17], al
  movzx eax, WORD PTR [rsi+32]
  vmovq xmm0, QWORD PTR [rbp-24]
  movzx eax, BYTE PTR [rdi+16+rax]
  movzx edx, WORD PTR [rsi+16]
  mov BYTE PTR [rbp-24], al
  movzx eax, WORD PTR [rsi+34]
  movzx edx, BYTE PTR [rdi+8+rdx]
  movzx eax, BYTE PTR [rdi+16+rax]
  mov BYTE PTR [rbp-23], al
  ...
  movzx eax, WORD PTR [rsi+46]
  movzx eax, BYTE PTR [rdi+16+rax]
  mov BYTE PTR [rbp-17], al
  mov rax, QWORD PTR [rbp-24]
  mov BYTE PTR [rbp-24], dl
  movzx edx, WORD PTR [rsi+18]
  vpinsrq xmm0, xmm0, rax, 1
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-23], dl
  movzx edx, WORD PTR [rsi+20]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-22], dl
  movzx edx, WORD PTR [rsi+22]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-21], dl
  movzx edx, WORD PTR [rsi+24]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-20], dl
  movzx edx, WORD PTR [rsi+26]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-19], dl
  movzx edx, WORD PTR [rsi+28]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-18], dl
  movzx edx, WORD PTR [rsi+30]
  movzx edx, BYTE PTR [rdi+8+rdx]
  mov BYTE PTR [rbp-17], dl
  movzx edx, WORD PTR [rsi]
  vmovq xmm1, QWORD PTR [rbp-24]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-24], dl
  movzx edx, WORD PTR [rsi+2]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-23], dl
  movzx edx, WORD PTR [rsi+4]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-22], dl
  ...
  movzx edx, WORD PTR [rsi+12]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-18], dl
  movzx edx, WORD PTR [rsi+14]
  movzx edx, BYTE PTR [rdi+rdx]
  mov BYTE PTR [rbp-17], dl
  vpinsrq xmm1, xmm1, QWORD PTR [rbp-24], 1
  vinserti128 ymm0, ymm0, xmm1, 0x1
  pop r10
  pop rbp
  lea rsp, [r10-8]
  ret

This approach will suffer 4 (non-dependent) store forwarding stalls when trying to read the stack buffer, which will make the latency somewhat worse than 32 cycles, perhaps in the mid-40s (if you assume it's the last stall that will be the one that isn't hidden). You could also just remove the gather64 function and unroll the whole thing in a 32-byte buffer, with a single load at the end. This result in only one stall, and get rid of the small overhead to load each 64-bit value into the result one at a time, but the overall effect might be worse, since larger loads seem to sometimes suffer larger forwarding stalls.

I'm quite sure you can come with up approaches that are better. For example, you could just write out "long hand" in intrinsics the vpinsrb approach that clang and icc use. That's simple enough that gcc should get it right.


Repeated Offset

What about if the offset array is used repeatedly for several different array inputs?

We can look at pre-processing the offset array so that our core load loop can be faster.

One viable approach is to use vgatherdd to efficiently load elements without bottlenecking on port 5 for the shuffles. We can load the entire gather index vector in a single 256-bit load as well. Unfortunately, the finest-grained vpgather is vpgatherdd which loads 8 32-bit elements using 32-bit offsets. So we'll need 4 of these gathers get all 32 byte-elements, and then need to blend the resulting vectors somehow.

We can actually avoid most of the cost of combining the resulting arrays by interleaving and adjusting the offsets so that the "target" byte in each 32-bit value is actually its correct final position. So you end up with 4 256-bit vectors, each with 8 bytes that you want, in the correct position, and 24 bytes you don't want. You can vpblendw two pairs of vectors together, and then vpblendb those results together, for a total of 3 port 5 uops (there's got to be a better way to do this reduction?).

Adding it all together, I get something like:

  • 4 movups to load the 4 vpgatherdd index regs (can be hoisted)
  • 4 vpgatherdd
  • 2 vpblendw (4 results -> 2)
  • 1 movups to load the vpblendb mask (can be hoisted)
  • 1 vpblendb (2 results -> 1)

Apart from the vpgatherdds it looks like about 9 uops, with 3 of them going to port 5, so 3 cycles bottlenecked on that port or about 2.25 cycles if there are no bottleneck (because the vpgatherdd might not use port 5). On Broadwell, the vpgather family is much improved over Haswell, but still takes about 0.9 cycles per element for vpgatherdd, so that's about 29 cycles right there. So we are right back to where we started, around 32 cycles.

Still, there is some hope:

  • The 0.9 cycles per element is for mostly pure vpgatherdd activity. Perhaps then the blending code is more or less free, and we are around 29 cycles (realistically, the movups will still be competing with the gather, however).
  • vpgatherdd got a lot better again in Skylake, to about 0.6 cycles per element, so this strategy will start to help significantly when you upgrade your hardware to Skylake. (And the strategy may pull slightly farther ahead of vpinsrb with AVX512BW, where byte blends with a k-register mask are efficient, and vpgatherdd zmm per-element gather throughput is slightly higher than ymm (InstLatx64).)
  • Pre-processing gives you the chance to check if duplicate elements are being read from array. In that case, you could potentially reduce the number of gathers. For example, if only half of the elements in offset are unique, you can only do two gathers to collect 16 elements and then pshufb register to duplicate elements as needed. The "reduction" has to be more general, but it doesn't actually seem more expensive (and could be cheaper) since pshufb is quite general does most of the work.

Expanding on that last idea: you would dispatch at runtime to a routine that knows how to do 1, 2, 3 or 4 gathers depending on how many elements are needed. That is fairly quantized, but you could always dispatch in a more fine-grained way with scalar loads (or gathers with larger elements, which are faster) between those cutoff points. You'll hit diminishing returns pretty quickly.

You can even extend that to handling nearby elements - after all, you are grabbing 4 bytes to get a byte: so if any of those 3 wasted bytes is actually at another used offset value, then you get it nearly for free. Now, this needs an even more general reduction phase but it still seems like pshufb will do the heavy lifting and most of the hard work is limited to the pre-processing.


1 This is one of a handful of SSE/AVX instructions where the memory source form of the instruction is quite a bit more efficient than the reg-reg form: the reg-reg form needs 2 uops on port 5 which limits it to a throughput of 0.5 per cycle and gives it a latency of 2. Evidently the memory load path avoids one of the shuffles/blends that are needed on port 5. vpbroadcastd/q are like that too.

2 With two loads and one store per cycle, this is will be running much close to the ragged edge of the maximum theoretical performance: it's maxing out the L1 operation throughput which often results in hiccups: for example, there may not be any spare cycles to accept incoming cache lines from L2.

这篇关于将带有uint16索引的AVX2字节收集到__m256i中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆