ARM NEON:比较128位值 [英] ARM NEON: comparing 128 bit values
问题描述
我有兴趣找到一种最快的方式(最低的周期计数),以比较存储在Cortex-A9内核(允许VFP指令)上的NEON寄存器(例如Q0和Q3)中的值.
I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored into NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).
到目前为止,我有以下内容:
So far I have the following:
(1)使用VFP浮点比较:
(1) Using the VFP floating point comparison:
vcmp.f64 d0, d6
vmrs APSR_nzcv, fpscr
vcmpeq.f64 d1, d7
vmrseq APSR_nzcv, fpscr
如果64位浮点数"与NaN等效,则此版本将不起作用.
If the 64bit "floats" are equivalent to NaN, this version will not work.
(2)使用NEON缩小和VFP比较(这次仅一次且以NaN安全的方式进行):
(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):
vceq.i32 q15, q0, q3
vmovn.i32 d31, q15
vshl.s16 d31, d31, #8
vcmp.f64 d31, d29
vmrs APSR_nzcv, fpscr
D29寄存器先前已预先装入了正确的16位模式:
The D29 register is previously preloaded with the right 16bit pattern:
vmov.i16 d29, #65280 ; 0xff00
我的问题是:还有什么比这更好的了吗?我正在监督一些明显的方法吗?
My question is: is there any better than this? Am I overseeing some obvious way to do it?
推荐答案
我相信您可以通过一条指令来减少它.通过使用左移并插入(VLSI),可以将Q15的4个32位值组合为D31中的4个16位值.然后,您可以将其与0进行比较并获得浮点标志.
I believe you can reduce it by one instruction. By using the shift left and insert (VLSI), you can combine the 4 32-bit values of Q15 into 4 16-bit values in D31. You can then compare with 0 and get the floating point flags.
vceq.i32 q15, q0, q3
vlsi.32 d31, d30, #16
vcmp.f64 d31, #0
vmrs APSR_nzcv, fpscr
这篇关于ARM NEON:比较128位值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!