测试xmm/ymm寄存器是否为零的更快方法? [英] Faster way to test if xmm/ymm register is zero?

查看:296
本文介绍了测试xmm/ymm寄存器是否为零的更快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

幸运的是,PTEST 不会影响进位标志,而只会设置(而不是笨拙的)ZF.同时会影响CF和ZF.

我想出了以下顺序来测试大量值,但是我对运行时间短感到不满意.

              Latency / rThoughput
setup:
  xor eax,eax       ; na
  vpxor xmm0,xmm0   ; na       ;mask to use for the nand operation of ptest
work:
  vptest xmm4,xmm0  ; 3   1    ;is xmm4 alive?
  adc eax,eax       ; 1   1    ;move first bit into eax
  vptest xmm5,xmm0  ; 3   1    ;is N alive?
  adc eax,eax       ; 1   1    ;move consecutive bits into eax 

我想在eax中具有所有非零寄存器的位图(很明显,我可以在多个寄存器中组合多个位图).

因此,每个测试的延迟时间为3 + 1 = 4个周期.
其中一些可以通过在eaxecx等之间交替进行并行运行.
但是它仍然很慢.
有更快的方法吗?

我需要连续测试8个xmm/ymm寄存器.一个字节位图中的每个寄存器1位.

解决方案

实际上,您现有的方法不是相当慢",而是合理的.

确保每个单独的测试的潜伏期为4个周期 1 ,但是如果您希望将结果放入通用寄存器中,通常需要支付3个周期无论如何,该移动的等待时间(例如,movmskb的等待时间也为3).无论如何,您都想测试8个寄存器,而不能简单地添加延迟,因为每个延迟大多是独立的,因此,uop计数和端口使用最终可能比测试单个寄存器的延迟更重要.的延迟将与其他工作重叠.

一种可能在Intel硬件上更快的方法是使用连续的PCMPEQ指令,测试多个向量,然后将结果折叠在一起(例如,如果您使用PCMPEQQ,则实际上有4个四字结果,并且需要并将它们折叠成1).您可以在PCMPEQ之前或之后折叠,但这将有助于您更多地了解如何/在何处得出更好的结果.这是8个寄存器的未经测试的草图,其中xmm1-8假定xmm0为零,而xmm14pblendvb掩码,用于选择上一条指令中使用的备用字节.

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm1, xmm0
vpcmpeqq xmm12, xmm3, xmm0
vpcmpeqq xmm13, xmm5, xmm0
vpcmpeqq xmm14, xmm7, xmm0

# blend the results down into xmm10   word origin
vpblendw xmm10, xmm11, xmm12, 0xAA   # 3131 3131
vpblendw xmm13, xmm13, xmm14, 0xAA   # 7575 7575
vpblendw xmm10, xmm10, xmm13, 0xCC   # 7531 7531

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm2, xmm0
vpcmpeqq xmm12, xmm4, xmm0
vpcmpeqq xmm13, xmm6, xmm0
vpcmpeqq xmm14, xmm8, xmm0

# blend the results down into xmm11   word origin
vpblendw xmm11, xmm11, xmm12, 0xAA   # 4242 4242
vpblendw xmm13, xmm13, xmm14, 0xAA   # 8686 8686
vpblendw xmm11, xmm11, xmm13, 0xCC   # 8642 8642

# blend xmm10 and xmm11 together int xmm100, byte-wise
#         origin bytes
# xmm10 77553311 77553311
# xmm11 88664422 88664422
# res   87654321 87654321 
vpblendvb xmm10, xmm10, xmm11, xmm15

# move the mask bits into eax
vpmovmskb eax, xmm10
and al, ah

直觉是您对每个xmm中的每个QWORD进行零测试,为8个寄存器给出16个结果,然后将结果混合到xmm10中,最后按顺序每个字节一个结果(所有高QWORD结果先于所有低QWORD结果).然后,将这16个字节的掩码作为16位以movmskb的形式移到eax中,最后将eax中每个寄存器的高低QWORD位合并.

在我看来,总共有8个寄存器,总共16微克,所以每个寄存器大约2微克.总等待时间是合理的,因为它主要是减少"类型的并行树.一个限制因素是6个vpblendw操作,这些操作只能在现代Intel的端口5上进行.最好用VPBLENDD替换其中的4个,这是去往p015中任何一个的祝福"混合.那应该很简单.

所有操作都很简单快捷.最后的and al, ah是部分寄存器写操作,但是如果将mov写入eax之后,则可能不会受到任何惩罚.如果这是一个问题,您也可以采用几种不同的方式来做最后一行...

此方法也自然地扩展到ymm寄存器,最后在eax中的折叠稍有不同.

编辑

稍微快一点的结尾使用压缩移位来避免两条昂贵的指令:

;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
; xmm10 77553311 77553311
; xmm11 88664422 88664422   before shift
; xmm10 07050301 07050301
; xmm11 80604020 80604020   after shift
;result 87654321 87654321   combined
vpsrlw xmm10,xmm10,8
vpsllw xmm11,xmm11,8
vpor xmm10,xmm10,xmm11

;combine the low and high dqword to make sure both are zero. 
vpsrldq xmm12,xmm10,64
vpand xmm10,xmm12
vpmovmskb eax,xmm10

通过避免2个周期vpblendvbor al,ah的部分写惩罚,这节省了2个周期,如果不需要立即使用该指令的结果,它还修复了对慢速vpmovmskb的依赖性.


1 实际上,实际上只是在Skylake上,PTEST的延迟时间为3个周期,而之前似乎为2个周期.我也不知道您的1个周期延迟针对rcl eax, 1列出的数据:根据Agner的说法,现代Intel似乎是3微秒和2个周期的延迟/接收吞吐量.

It's fortunate that PTEST does not affect the carry flag, but only sets the (rather awkward) ZF. also affects both CF and ZF.

I've come up with the following sequence to test a large number of values, but I'm unhappy with the poor running time.

              Latency / rThoughput
setup:
  xor eax,eax       ; na
  vpxor xmm0,xmm0   ; na       ;mask to use for the nand operation of ptest
work:
  vptest xmm4,xmm0  ; 3   1    ;is xmm4 alive?
  adc eax,eax       ; 1   1    ;move first bit into eax
  vptest xmm5,xmm0  ; 3   1    ;is N alive?
  adc eax,eax       ; 1   1    ;move consecutive bits into eax 

I want to have a bitmap of all the non-zero registers in eax (obviously I can combine multiple bitmaps in multiple registers).

So every test has a latency of 3+1 = 4 cycles.
Some of this can run in parallel by alternating between eax,ecx etc.
But it's still quite slow.
Is there a faster way of doing this?

I need to test 8 xmm/ymm registers in a row. 1 bit per register in a one byte bitmap.

解决方案

Rather than being "quite slow" your existing approach is reasonable, actually.

Sure each individual test has a latency of 4 cycles1, but it if you want the result in a general purpose register you are usually going to pay a 3 cycle latency for that move anyway (e.g., movmskb also has a latency of 3). In any case, you want to test 8 registers, and you don't simply add the latencies because each one is mostly independent, so uop count and port use will likely end up being more important that the latency to test a single register as most of the latencies will overlap with other work.

An approach that is likely to be a bit faster on Intel hardware is using successive PCMPEQ instructions, to test several vectors, and then folding the results together (e.g., if you use PCMPEQQ you effectively have 4 quadword results and need to and-fold them into 1). You can either fold before or after the PCMPEQ, but it would help to know more about how/where you want to results to come up with something better. Here's an untested sketch for 8 registers, xmm1-8 with xmm0 assumed zero, and xmm14 being the pblendvb mask to select alternate bytes used in the last instruction.

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm1, xmm0
vpcmpeqq xmm12, xmm3, xmm0
vpcmpeqq xmm13, xmm5, xmm0
vpcmpeqq xmm14, xmm7, xmm0

# blend the results down into xmm10   word origin
vpblendw xmm10, xmm11, xmm12, 0xAA   # 3131 3131
vpblendw xmm13, xmm13, xmm14, 0xAA   # 7575 7575
vpblendw xmm10, xmm10, xmm13, 0xCC   # 7531 7531

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm2, xmm0
vpcmpeqq xmm12, xmm4, xmm0
vpcmpeqq xmm13, xmm6, xmm0
vpcmpeqq xmm14, xmm8, xmm0

# blend the results down into xmm11   word origin
vpblendw xmm11, xmm11, xmm12, 0xAA   # 4242 4242
vpblendw xmm13, xmm13, xmm14, 0xAA   # 8686 8686
vpblendw xmm11, xmm11, xmm13, 0xCC   # 8642 8642

# blend xmm10 and xmm11 together int xmm100, byte-wise
#         origin bytes
# xmm10 77553311 77553311
# xmm11 88664422 88664422
# res   87654321 87654321 
vpblendvb xmm10, xmm10, xmm11, xmm15

# move the mask bits into eax
vpmovmskb eax, xmm10
and al, ah

The intuition is that you test each QWORD in each xmm against zero, giving 16 results for the 8 registers, and then you blend the results together into xmm10 ending up with one result per byte, in order (with all high-QWORD results before all the low-QWORD results). Then you move those 16 byte masks as 16-bits into eax with movmskb and finally combine the high and low QWORD bits for each register inside eax.

That looks to me like 16 uops total, for 8 registers, so about 2 uops per register. The total latency is reasonable since it largely a "reduce" type parallel tree. A limiting factor would be the 6 vpblendw operations which all go only to port 5 on modern Intel. It would be better to replace 4 of those with VPBLENDD which is the one "blessed" blend that goes to any of p015. That should be straightforward.

All the ops are simple and fast. The final and al, ah is a partial register write, but if you mov it after into eax perhaps there is no penalty. You could also do that last line a couple of different ways if that's an issue...

This approach also scales naturally to ymm registers, with slightly different folding in eax at the end.

EDIT

A slightly faster ending uses packed shifts to avoid two expensive instructions:

;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
; xmm10 77553311 77553311
; xmm11 88664422 88664422   before shift
; xmm10 07050301 07050301
; xmm11 80604020 80604020   after shift
;result 87654321 87654321   combined
vpsrlw xmm10,xmm10,8
vpsllw xmm11,xmm11,8
vpor xmm10,xmm10,xmm11

;combine the low and high dqword to make sure both are zero. 
vpsrldq xmm12,xmm10,64
vpand xmm10,xmm12
vpmovmskb eax,xmm10

This saves 2 cycles by avoiding the 2 cycle vpblendvb and the partial write penalty of or al,ah, it also fixes the dependency on the slow vpmovmskb if don't need to use the result of that instruction right away.


1Actually it seems to be only on Skylake that PTEST has a latency of three cycles, before that it seems to be 2. I'm also not sure about the 1 cycle latency you listed for rcl eax, 1: according to Agner, it seems to be 3 uops and 2 cycles latency/recip throughput on modern Intel.

这篇关于测试xmm/ymm寄存器是否为零的更快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆