如何比较装配数组元素 [英] How to compare elements in array in assembly

查看:157
本文介绍了如何比较装配数组元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在汇编语言x86的数组中的所有元素比较?我要比较数组中的所有元素,并打印最大的一个

How to compare all the elements in an array in Assembly language x86? I have to compare all the elements in an array and print the biggest one

推荐答案

我要pretend这不是一个可怕的琐碎问题,实际上是在讨论汇编语言这样做的有趣的部分(而不是让编译器优化它你)。

I'm going to pretend this wasn't a terrible trivial question, and actually discuss the interesting parts of doing this in assembly language (instead of letting a compiler optimize it for you).

在ASM,你可以像你会在任何其他语言做。但是,如果你与向量指令编程的机器,你可以而且应该使用它们。编译器通常会为你做这一点,但在ASM你必须自己做。

In asm, you could do it like you would in any other language. However, if you're programming for a machine with vector instructions, you can and should use them. Compilers will usually do this for you, but in asm you have to do it yourself.

由于在ASM书面code最主要的原因是高性能,让我们考虑一些问题:

Since the main reason for writing code in asm is high performance, let's consider some of the issues:

如果没有向量指令,它可能会或可能不会使用条件,动做平常 X =最大值是一个好主意(X,A [I]) CMOV 将引入一个循环携带依赖性,这可能会伤害更多的性能比偶尔的分支误predict。 (谷歌所有关于这一点)。

Without vector instructions, it might or might not be a good idea to use a conditional-move to do the usual x=max(x, a[i]). cmov would introduce a loop-carried dependency, which might hurt performance more than the occasional branch mispredict. (google for more about this).

分支误predicts发现最大的时候,除非你的价值观是嘈杂,但在平均增长可能并不频繁。 (例如新的最高被认为是每1〜10个元素是接近最坏情况下)。否则,你可能倾向于去很长一段时间要么始终看到一个新的最大值或从来没有看到一个新的最大值。

Branch mispredicts are probably not frequent when finding the max, unless your values are noisy but on average increasing. (e.g. a new max is seen every 1 to 10 elements would be near worst-case.) Otherwise, you probably tend to go for a long time either always seeing a new max or never seeing a new max.

86有工作就像每个元素的基础上CMP / CMOV矢量最小/最大指令。

x86 has vector min/max instructions that work like a cmp/cmov on a per-element basis.

所以,如果你的阵列由32位有符号整数,可以通过装载第4个元素到向量寄存器(比如 XMM0 )的使用开始,然后用添加RSI,16 / PMAXSD XMM0,[RSI] 循环里面做4包装 X = MAX(X,SRC)操作。 PMAXSD 英文是:盒装签名双字元素(整数)最大。请参见 86 维基参考指南。 PMAXSD 是SSE4.1的一部分,所以它仅支持使用该功能位CPU。

So if your array is made up of 32bit signed integers, you could use start by loading the first 4 elements into a vector register (say xmm0), then use add rsi, 16 / PMAXSD xmm0, [rsi] inside a loop to do 4 packed x=max(x,src) operations. PMAXSD in English is: Packed(integer) Max of Signed DWord elements. See the links in the x86 wiki for reference guides. PMAXSD is part of SSE4.1, so it's only supported on CPUs with that feature-bit.

如果您的阵列由的uint8_t有元素,你会使用 PMINUB (盒装(INT)最小无符号字节元素)。 PMIN / MAXUB PMIN / MAXSW 在SSE2,所以他们对x86-64的(以及x86的基线-32上需要与支持SSE2新的足够的硬件)操作系统。

If your array was made up of uint8_t elements, you'd use PMINUB (Packed(int) Min of Unsigned Byte elements). PMIN/MAXUB and PMIN/MAXSW are in SSE2, so they're baseline for x86-64 (and for x86-32 on OSes which require new enough hardware with SSE2 support).

循环阵列上(也许用PALIGNR或PSRLDQ来处理数组的最后一个非多的-16B位)之后,你必须在你的蓄电池矢量4个元素。每个人是每4个元素的最大值,为四个不同的偏移。要获得全部测试MAX,你需要找到水平在向量最大元素。通过改组它(如字节移是正确的,那么高的两个元素转移到低两个元素的位置),然后使用比较它 PMAXSD 反对这样做联合国洗牌值。然后重复上述过程,以获得最后两个元素的最大。

After looping over the array (maybe using PALIGNR or PSRLDQ to handle the last non-multiple-of-16B bit of the array), you'll have 4 elements in your accumulator vector. Each one is the max of every 4th element, for four different offsets. To get the overal max, you need to horizontally find the max element in the vector. Do this by shuffling it (e.g. byte-shifting it right, so the high two elements move to the position of the low two elements), then and comparing it using PMAXSD against the un-shuffled value. Then repeat the process, to get the max of the last two elements.

现在可以存储32位INT到内存中,或使用 MOVD 直接从 XMM0 将其传送到 EAX 作为函数的返回值。

Now you can store that 32bit int to memory, or use movd to transfer it directly from xmm0 to eax as a function return value.

还有一定的提升空间在这里,因为即使 pmaxsd 有一个周期(英特尔Haswell的为例)的延迟,它有一个吞吐量的2%的周期。因此,理想情况下,我们可以维持吞吐量每个时钟两PMAX的,内存操作数。 (由于英特尔SNB及以后有两个负载口,L1缓存可以用这个跟上。)我们需要使用多个蓄电池允许并行操作。 (然后PMAX所有的蓄电池一起底,做横向操作之前)。

There's some room for improvement here, since even though pmaxsd has a latency of one cycle (Intel Haswell for example), it has a throughput of 2 per cycle. So ideally we can sustain a throughput of two PMAX per clock, with memory operands. (Since Intel SnB and onward have two load ports, L1 cache can keep up with this.) We need to use multiple accumulators to allow parallel operation. (And then PMAX all the accumulators together at the end, before doing the horizontal operation).

;;; probably buggy, use at your own risk.  edits welcome
global max_array
max_array:
    ; function args: int *rsi, uint64_t rdi
    ; requirements: src is aligned on a 16B boundary, size is a multiple of 32bytes (8 elements), and >=8 on entry
    ; TODO: support unaligned with some startup code, and a partial final iteration with some cleanup

    lea rdx, [rsi + 4*rdi]   ; end pointer

    movdqa  xmm0, [rsi]      ; two accumulators
    movdqa  xmm1, [rsi + 16]  

    add   rsi, 32
    cmp   rsi, rdx
    jae .out        ; early exit if we shouldn't run the loop even once.  unsigned compare for addresses

.loop:
    pmaxsd  xmm0, [rsi]
    pmaxsd  xmm1, [rsi+16]
    add     rsi,  32
    cmp     rsi, rdx   ;; loop is 4 uops on Intel, since this cmp/branch macro-fuses
    jb   .loop

.out:
    ;; TODO: cleanup code to handle any non-multiple-of-8 iterations.

    pmaxsd  xmm0, xmm1
    movhlps xmm1, xmm0    ; xmm0 = { d, c, b, a}.  xmm1 = { d, c, d, c }
    pmaxsd  xmm0, xmm1    ; xmm0 = { d, c, max(d,b), max(c, a) }
                          ; if we were using AVX 3-operand instructions, we'd use PSRLDQ and another pmax because it's easy.

    ; do the final stage of horizontal MAX in integer registers, just for fun.
    ; pshufd/pmax to do the last level would be faster than this shld/cmp/cmov.

    movq    rax, xmm0     ; rax = { max(d,b), max(c,a) }
             ;  two-reg shift to unpack rax into edx:eax (with garbage in the high half of both)
    shld    rdx, rax, 32  ; rax = unchanged (eax=max(c,a)),  edx = max(d,b).
    cmp     edx, eax
    cmovg   eax, edx      ; eax = max( max(c,a), max(d,b) )
    ret

在理论上,这种运行在每个时钟一个迭代的英特尔SNB家族的微架构。 4个稠域微指令每时钟饱和管,但更多的展开(和使用更多的蓄电池)只是使清理code为这种非玩具版本更大的头痛。

In theory, this runs at one iteration per clock on Intel SnB-family microarchitectures. 4 fused-domain uops per clock saturates the pipe, but unrolling more (and using more accumulators) just makes the cleanup code for a non-toy version of this a bigger headache.

这篇关于如何比较装配数组元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆