复制到NASM中的阵列 [英] Copying to arrays in NASM

查看:84
本文介绍了复制到NASM中的阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须编写汇编代码,该代码在循环中复制100字节在内存中.我是这样写的:

I have to write in assembly code which copy 100 bytes in memory in loop. I wrote it like this:

section .data
    a times 100 db 1 ;reserve 100 bytes and fill with 1
    b times 100 db 0 ;reserve 100 bytes and fill with 0

    section _start
    global _start

    _start:
    mov rsi, a ;get array a address
    mov rdi, b ;get arrat b address

    _for: ;początek pętli
    cmp cx, 100     ;loop
    jae _end_for        ;loop
    push cx         ;loop

    mov byte al, [rsi]  ;get one byte from array a from al
    mov byte [rdi], al  ;put one byte from al to array b
    inc rsi         ;set rsi to next byte in array a
    inc rdi         ;set rdi to next byte in array b

    pop cx          ;loop
    inc cx          ;loop
    jmp _for        ;loop

_end_for:

_end:
    mov rax, 60
    mov rdi, 0
    syscall

我不确定复制部分.我从地址读取值到寄存器,然后将其放入另一个.这对我来说看起来不错,但是我不确定要递增rsirdi.

I'm not sure about the copying part. I read the value from the address to the register and then put it into another. That looks good to me, but I'm not sure about incrementing rsi and rdi.

真的够吗?
我是NASM和组装的新手,请帮助:-)

Is it really enough?
I'm new to NASM and assembly, so please help :-)

推荐答案

我知道rep movsb,但是任务是逐个字节地使其在循环中进行,我不知道它是否可以做得更好.

I know about rep movsb but task has been to make it in loop byte after byte, I don't know if it could be done better way.

如果您必须一次循环1个字节,这是有效执行此操作的方法.值得一提的是,高效循环对于memcpy以外的情况也很有用!

If you have to loop 1 byte at a time, here's how to do that efficiently. It's worth mentioning because looping efficiently is useful for cases other than memcpy as well!

首先,您知道循环主体应该至少运行一次,因此可以使用在底部带有条件分支的普通循环结构. (为什么循环总是被编译成"do ...同时样式(尾巴跳)?)

First of all, you know that your loop body should run at least once, so you can use a normal loop structure with a conditional branch at the bottom. (Why are loops always compiled into "do...while" style (tail jump)?)

第二,如果您根本不打算展开,那么应该使用索引寻址模式,以避免必须增加两个指针. (但实际上最好将其展开).

Second, if you're not going to unroll at all then you should use an indexed addressing mode to avoid having to increment both pointers. (But really it would be better to unroll).

如果不需要,请不要使用16位寄存器.最好使用32位操作数大小(ECX);编写一个32位寄存器隐式零扩展到64位,因此可以安全地将索引用作寻址模式的一部分.

And don't use 16-bit registers if you don't have to. Prefer 32-bit operand-size (ECX); writing a 32-bit register implicitly zero-extends to 64-bit so it's safe to use an index as part of an addressing mode.

您可以使用带索引的负载,但可以使用非索引存储,因此您的存储地址uops仍可以在port7上运行,这使得它在Haswell/Skylake上对超线程更加友好.并避免在Sandybridge上分层.显然,一次复制1个字节对于性能来说是完全浪费的,但是有时候您确实想循环并实际上在每个字节在寄存器中的时候对每个字节执行某些操作,因此您可以不会使用SSE2手动对其进行矢量化处理(一次执行16个字节).

You can use an indexed load but a non-indexed store so your store-address uops can still run on port7, making this slightly more hyperthreading-friendly on Haswell/Skylake. And avoiding un-lamination on Sandybridge. Obviously copying 1 byte at a time is total garbage for performance, but sometimes you do want to loop and actually do something with each byte while it's in a register, and you can't manually vectorize it with SSE2 (to do 16 bytes at a time).

您可以通过相对于dst索引src来做到这一点.

You can do this by indexing the src relative to the dst.

或者另一个技巧是将负索引向上计数到零,这样就避免了额外的cmp.让我们先这样做:

Or the other trick is to count a negative index up towards zero, so you avoid an extra cmp. Lets do that first:

default rel       ; use RIP-relative addressing modes by default

ARR_SIZE  equ 100
section .data
    a:  times ARR_SIZE db 1

section .bss
    b:  resb ARR_SIZE       ;reserve n bytes of space in the BSS

    ;section _start   ; do *not* use custom section names unless you have a good reason
                      ; they might get linked with unexpected read/write/exec permission

section .text
global _start
_start:
    lea     rsi, [a+ARR_SIZE]   ; pointers to one-past-the-end of the arrays
    lea     rdi, [b+ARR_SIZE]   ; RIP-relative LEA is better than mov r64, imm64

    mov     rcx, -ARR_SIZE

.copy_loop:                 ; do {
    movzx   eax, byte [rsi+rcx]  ; load without a false dependency on the old value of RAX
    mov     [rdi+rcx], al
    inc     rcx
    jnz    .copy_loop       ; }while(++idx != 0);

.end:
    mov  eax, 60
    xor  edi, edi
    syscall             ; sys_exit(0)

在诸如静态(或其他非PIE)Linux可执行文件之类的位置相关代码中,mov edi, b+ARR_SIZE是将静态地址放入寄存器的最有效方法.

In position-dependent code like a static (or other non-PIE) Linux executable, mov edi, b+ARR_SIZE is the most efficient way to put a static address into a register.

请勿对所有标签名称使用_.之所以用_start命名,是因为以_开头的C符号名称被保留以供实现使用.这不是您应该复制的东西;事实恰恰相反.

Don't use _ for all your label names. _start is named that way because C symbol names that begin with _ are reserved for use by the implementation. It's not something you should copy; in fact the opposite is true.

.foo用作函数中的本地标签名称.例如.foo:_start.foo:的简写,如果您在_start之后使用它.

Use .foo for a local label name inside a function. e.g. .foo: is shorthand for _start.foo: if you use it after _start.

相对于dst的src索引:

通常,您的输入和输出都不都是在静态存储中,因此您必须在运行时sub地址.在这里,如果 像您最初所做的那样将它们放在同一部分中,则mov rcx, a-b实际上会组装.但是,如果没有,NASM拒绝.

Normally your input and output aren't both in static storage, so you have to sub the addresses at runtime. Here, if we put them both in the same section like you were originally doing, mov rcx, a-b will actually assemble. But if not, NASM refuses.

实际上,我可以执行[rdi + (a-b)]或简单地执行[rdi - ARR_SIZE],而不是使用2寄存器寻址模式,因为我知道它们是连续的.

In fact instead of a 2-register addressing mode, I could just be doing [rdi + (a-b)], or simply [rdi - ARR_SIZE] because I know they're contiguous.

_start:
    lea     rdi, [b]   ; RIP-relative LEA is better than mov r64, imm64
    mov     rcx, a-b   ; distance between arrays so  [rdi+rcx] = [a]
;;; for a-b to assemble, I had to move b back to the .data section.

    lea     rdx, [rdi+ARR_SIZE]    ; end_dst pointer

.copy_loop:                 ; do {
    movzx   eax, byte [rdi + rcx]    ; src = dst+(src-dst)
    mov     [rdi], al
    inc     rdi

    cmp     rdi, rdx
    jbe    .copy_loop       ; }while(dst < end_dst);

数组末尾指针与您在C ++中使用foo.end()获取指向过去的指针/迭代器的方式完全一样.

An end-of-the-array pointer is exactly like you'd do in C++ with foo.end() to get a pointer / iterator to one-past-the-end.

这需要INC + CMP/JCC作为循环开销.在AMD CPU上,CMP/JCC可以将宏熔合为1个uop,而INC/JCC则不能,因此从末尾开始的额外CMP与索引编制基本上是免费的. (代码大小除外).

This needs INC + CMP/JCC as loop overhead. On AMD CPUs, CMP/JCC can macro-fuse into 1 uop but INC/JCC can't, so the extra CMP vs. indexing from the end is basically free. (Except for code-size).

在Intel上,这避免了建立索引存储.在这种情况下,负载是纯负载,因此无论如何它都是单个uop,而无需与ALU uop保持微融合.英特尔可以对inc/jcc进行宏熔丝处理,因此这确实会增加额外的循环开销.

On Intel this avoids an indexed store. The load is a pure load in this case, so it's a single uop anyway without needing to stay micro-fused with an ALU uop. Intel can macro-fuse inc/jcc so this does cost an extra uop of loop overhead.

如果您要展开,并且不需要避免为负载分配索引的寻址方式,则这种循环方式是不错的选择.但是,如果将内存源用于vaddps ymm0, ymm1, [rdi]之类的ALU指令,那么是的,您应该分别增加两个指针,以便可以对加载和存储使用非索引寻址模式,因为这样Intel Intel CPU效率更高. (端口7存储AGU仅处理未索引的,并且一些微融合的负载会与已索引的寻址模式分层.微融合和寻址模式)

This way of looping is good if you're unrolling, if you don't need to avoid an indexed addressing mode for loads. But if you're using a memory source for an ALU instruction like vaddps ymm0, ymm1, [rdi], then yes you should increment both pointers separately so you can use non-indexed addressing modes for both loads and stores, because Intel CPUs are more efficient that way. (Port 7 store AGU handles non-indexed only, and some micro-fused loads unlaminate with indexed addressing mode. Micro fusion and addressing modes)

这篇关于复制到NASM中的阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆