如何将3个字节(24位)从内存移动到寄存器? [英] How to MOVe 3 bytes (24bits) from memory to a register?

查看:149
本文介绍了如何将3个字节(24位)从内存移动到寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用MOV指令将存储在存储器中的数据项移动到我选择的通用寄存器中.

MOV r8, [m8]
MOV r16, [m16]
MOV r32, [m32]
MOV r64, [m64]

现在,别开枪打我,但如何实现以下目标:MOV r24, [m24]? (我很欣赏后者是不合法的.)

在我的示例中,我想移动字符"Pip"(即0x706950h)以注册rax.

section .data           ; Section containing initialized data

14      DogsName: db "PippaChips"
15      DogsNameLen: equ $-DogsName

我首先考虑可以分别移动字节,即首先移动一个字节,然后移动一个字,或它们的某种组合.但是,我无法引用eaxrax的上半部分",因此这下降到了第一个障碍,因为我最终将覆盖首先移动的任何数据.

我的解决方案:

26    mov al, byte [DogsName + 2] ; move the character p to register al
27    shl rax, 16                 ; shift bits left by 16, clearing ax to receive characters pi
28    mov ax, word [DogsName]     ; move the characters Pi to register ax

我可以将"Pip"声明为已初始化的数据项,但示例仅是一个示例,我想了解如何在汇编中引用24位或40、48….

是否还有类似于MOV r24, [m24]的指令?与提供偏移量和指定大小运算符相反,有没有一种方法可以选择存储器地址的范围.如何从内存中移出3个字节以在ASM x86_64中注册?

NASM版本2.11.08 x86体系结构

解决方案

通常,您会进行4字节的加载并掩盖您想要的字节附带的高垃圾 >或干脆忽略它,如果您正在使用不关心高位数据的数据. 哪个2的补码整数操作可以如果只需要结果的低位部分,可以在不将输入的高位清零的情况下使用它?


不同于存储 1 ,除非您进入未映射的页面,否则加载不应该"的数据永远不会带来正确性. (例如,如果db "pip"位于页面的末尾,而未映射下一页.)但是在这种情况下,您知道是较长字符串的一部分,因此,唯一可能的缺点是性能如果较大的负载延伸到下一个缓存行(因此负载越过缓存行边界). 将结果零扩展到eax/rax(如32位操作数大小),而不是与8或16位操作数的EAX/RAX现有高字节合并-size寄存器写入.如果确实要合并,请屏蔽旧值和OR.或者,如果您是从[DogsName-1]加载的,则所需的字节位于EAX的前3个位置,并且您希望合并到ECX中:shr ecx, 24/shld ecx, eax, 24将旧的高位字节向下移动到底部,然后移动它在移入3个新字节的同时返回. (很遗憾,没有shld的内存源形式.半相关的:部分寄存器写入在不同CPU上的行为不同.在Core2/Nehalem之外的其他CPU上,您的方式(将字节加载/移位/字加载到ax中)相当好(组装后在读取eax时将插入合并的uop会停滞不前).但是从movzx eax, byte [DogsName + 2]开始,以打破对rax的旧值的依赖.

您希望编译器生成的经典无处不在"代码将是:

DEFAULT REL      ; compilers use RIP-relative addressing for static data; you should too.
movzx   eax, byte [DogsName + 2]   ; avoid false dependency on old EAX
movzx   ecx, word [DogsName]
shl     eax, 16
or      eax, ecx

这需要一条额外的指令,但避免写入任何部分寄存器.但是,在Core2或Nehalem以外的CPU上,两次加载的最佳选择是写入ax. (Core2之前的Intel P6无法运行x86-64代码,并且在编写ax时,未进行部分寄存器重命名的CPU将合并为rax). Sandybridge仍会重命名AX,但是合并仅花费1个uop,而不会停止,即与OR相同,但是在Core2/Nehalem上,前端会在插入合并uop时停顿3个周期.

Ivybridge及更高版本仅重命名AH,而不重命名AXAL ,因此在这些CPU上,AX的负载是微融合的负载+合并. Agner Fog并未在Silvermont或Ryzen(或我查看的电子表格中的任何其他选项卡)上列出mov r16, m的额外罚款,因此,大概其他未进行部分注册更名的CPU也会执行mov ax, [mem]作为负载+合并.

movzx   eax, byte [DogsName + 2]
shl     eax, 16
mov      ax, word [DogsName]

; using eax: 
  ; Sandybridge: extra 1 uop inserted to merge
  ; core2 / nehalem: ~3 cycle stall (unless you don't use it until after the load retires)
  ; everything else: no penalty


实际上,可以在运行时高效地进行对齐测试.给定寄存器中的指针,除非地址的最后几个5或6位全为零,否则前一个字节在同一高速缓存行中. (即地址与缓存行的开头对齐).假设高速缓存行为64字节;当前所有的CPU都使用它,我认为不存在任何32字节行的x86-64 CPU. (而且我们仍然绝对避免跨页).

    ; pointer to m24 in RSI
    ; result: EAX = zero_extend(m24)

    test   sil, 111111b     ; test all 6 low bits.  There's no TEST r32, imm8, so  REX r8, imm8 is shorter and never slower.
    jz   .aligned_by_64

    mov    eax, [rsi-1]
    shr    eax, 8
.loaded:

    ...
    ret    ; end of whatever large function this is part of

 ; unlikely block placed out-of-line to keep the common case fast
.aligned_by_64:
    mov    eax, [rsi]
    and    eax, 0x00FFFFFF
    jmp   .loaded

因此,在通常情况下,额外的成本只是一个未经测试的分支机构.

取决于CPU,输入和周围的代码,测试低12位(仅避免越过4k边界)将权衡页面中某些缓存行拆分的更好的分支预测,但仍然永远不会有缓存行分裂. (在这种情况下为test esi, (1<<12)-1.与使用imm8测试sil不同,使用imm16测试si在Intel CPU上进行LCP停顿以节省1字节的代码是不值得的.当然,如果可以的话将指针放在ra/b/c/dx中,则不需要REX前缀,甚至test al, imm8都有紧凑的2字节编码.)

您甚至可以无分支地执行此操作,但是与仅执行两个单独的加载相比显然不值得!

    ; pointer to m24 in RSI
    ; result: EAX = zero_extend(m24)

    xor    ecx, ecx
    test   sil, 7         ; might as well keep it within a qword if  we're not branching
    setnz  cl             ; ecx = (not_start_of_line) ? : 1 : 0

    sub    rsi, rcx       ; normally rsi-1
    mov    eax, [rsi]

    shl    ecx, 3         ; cl = 8 : 0
    shr    eax, cl        ; eax >>= 8  : eax >>= 0

                          ; with BMI2:  shrx eax, [rsi], ecx  is more efficient

    and    eax, 0x00FFFFFF  ; mask off to handle the case where we didn't shift.


真正的体系结构24位加载或存储

在架构上,x86没有24位加载或存储,其中以 integer 寄存器作为目标或源.正如布兰登指出的那样,MMX/SSE屏蔽了商店(例如 MASKMOVDQU ,而不是与 pmovmskb eax, xmm0 混淆)可以存储从MMX或XMM寄存器获得的24位,给定了仅设置了低3字节的向量掩码.但是它们几乎永远不会有用,因为它们速度慢并且总是具有NT提示(因此,它们在高速缓存周围进行写操作,并像movntdq这样强制逐出). (AVX dword/qword屏蔽的加载/存储指令并不意味着NT,但是字节粒度不可用.)

AVX512BW(Skylake服务器)添加了vmovdqu8 ,它为您提供了用于负载和存储的字节掩码,并为被屏蔽的字节提供了故障抑制功能. (即,如果未为16字节的加载在未映射的页面中包含字节,则不会出现段错误,只要未为该字节设置掩码位即可.但这确实会造成很大的影响).因此,微体系结构仍然是16字节的负载,但是对体系结构状态(即除性能以外的所有内容)的影响恰好是真正的3字节的加载/存储(使用正确的掩码). >

您可以在XMM,YMM或ZMM寄存器上使用它.

;; probably slower than the integer way, especially if you don't actually want the result in a vector
mov       eax, 7                  ; low 3 bits set
kmovw     k1, eax                 ; hoist the mask setup out of a loop


; load:  leave out the {z} to merge into the old xmm0 (or ymm0 / zmm0)
vmovdqu8  xmm0{k1}{z}, [rsi]    ; {z}ero-masked 16-byte load into xmm0 (with fault-suppression)
vmovd     eax, xmm0

; store
vmovd     xmm0, eax
vmovdqu8  [rsi]{k1}, xmm0       ; merge-masked 16-byte store (with fault-suppression)

它与NASM 2.13.01组装在一起.如果您的NASM足够新以支持AVX512,则为IDK.您可以使用英特尔的软件开发仿真器(SDE)

这看起来很酷,因为将结果存入eax仅需2 uops(一旦设置了蒙版). (不过, http://instlatx64.atw.hu/的电子表格rax.

section .data           ; Section containing initialized data

14      DogsName: db "PippaChips"
15      DogsNameLen: equ $-DogsName

I first considered that I could move the bytes separately, i.e. first a byte, then a word, or some combination thereof. However, I cannot reference the ‘top halves’ of eax, rax, so this falls down at the first hurdle, as I would end up over-writing whatever data was moved first.

My solution:

26    mov al, byte [DogsName + 2] ; move the character "p" to register al
27    shl rax, 16                 ; shift bits left by 16, clearing ax to receive characters "pi"
28    mov ax, word [DogsName]     ; move the characters "Pi" to register ax

I could just declare "Pip" as an initialized data item, but the example is just that, an example, I want to understand how to reference 24 bits in assembly, or 40, 48… for that matter.

Is there an instruction more akin to MOV r24, [m24]? Is there a way to select a range of memory addresses, as opposed to providing the offset and specifying a size operator. How to move 3 bytes from memory to register in ASM x86_64?

NASM version 2.11.08 Architecture x86

解决方案

Normally you'd do a 4-byte load and mask off the high garbage that came with the bytes you wanted, or simply ignore it if you're doing something with the data that doesn't care about high bits. Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?


Unlike stores1, loading data that you "shouldn't" is never a problem for correctness unless you cross into an unmapped page. (E.g. if db "pip" came at the end of a page, and the following page was unmapped.) But in this case, you know it's part of a longer string, so the only possible downside is performance if a wide load extends into the next cache line (so the load crosses a cache-line boundary). Is it safe to read past the end of a buffer within the same page on x86 and x64?

Either the byte before or the byte after will always be safe to access, for any 3 bytes (without even crossing a cache-line boundary if the 3 bytes themselves weren't split between two cache lines). Figuring this out at run-time is probably not worth it, but if you know the alignment at compile time, you can do either

mov   eax, [DogsName-1]     ; if previous byte is in the same page/cache line
shr   eax, 8

mov   eax, [DogsName]       ; if following byte is in the same page/cache line
and   eax, 0x00FFFFFF

I'm assuming you want to zero-extend the result into eax/rax, like 32-bit operand-size, instead of merging with the existing high byte(s) of EAX/RAX like 8 or 16-bit operand-size register writes. If you do want to merge, mask the old value and OR. Or if you loaded from [DogsName-1] so the bytes you want are in the top 3 positions of EAX, and you want to merge into ECX: shr ecx, 24 / shld ecx, eax, 24 to shift the old top byte down to the bottom, then shift it back while shifting in the 3 new bytes. (There's no memory-source form of shld, unfortunately. Semi-related: efficiently loading from two separate dwords into a qword.) shld is fast on Intel CPUs (especially Sandybridge and later: 1 uop), but not on AMD (http://agner.org/optimize/).


Combining 2 separate loads

There are many ways to do this, but there's no single fastest way across all CPUs, unfortunately. Partial-register writes behave differently on different CPUs. Your way (byte load / shift / word-load into ax) is fairly good on CPUs other than Core2/Nehalem (which will stall to inserting a merging uop when you read eax after assembling it). But start with movzx eax, byte [DogsName + 2] to break the dependency on the old value of rax.

The classic "safe everywhere" code that you'd expect a compiler to generate would be:

DEFAULT REL      ; compilers use RIP-relative addressing for static data; you should too.
movzx   eax, byte [DogsName + 2]   ; avoid false dependency on old EAX
movzx   ecx, word [DogsName]
shl     eax, 16
or      eax, ecx

This takes an extra instruction, but avoids writing any partial registers. However, on CPUs other than Core2 or Nehalem, the best option for 2 loads is writing ax. (Intel P6 before Core2 can't run x86-64 code, and CPUs without partial-register renaming will merge into rax when writing ax). Sandybridge does still rename AX, but the merge only costs 1 uop with no stalling, i.e. same as the OR, but on Core2/Nehalem the front-end stalls for about 3 cycles while inserting the merge uop.

Ivybridge and later only rename AH, not AX or AL, so on those CPUs, the load into AX is a micro-fused load+merge. Agner Fog doesn't list an extra penalty for mov r16, m on Silvermont or Ryzen (or any other tabs in the spreadsheet I looked at), so presumably other CPUs without partial-reg renaming also execute mov ax, [mem] as a load+merge.

movzx   eax, byte [DogsName + 2]
shl     eax, 16
mov      ax, word [DogsName]

; using eax: 
  ; Sandybridge: extra 1 uop inserted to merge
  ; core2 / nehalem: ~3 cycle stall (unless you don't use it until after the load retires)
  ; everything else: no penalty


Actually, testing alignment at run-time can be done efficiently. Given a pointer in a register, the previous byte is in the same cache line unless the last few 5 or 6 bits of the address are all zero. (i.e. the address is aligned to the start of a cache line). Lets assume cache lines are 64 bytes; all current CPUs use that, and I don't think any x86-64 CPUs with 32-byte lines exist. (And we still definitely avoid page-crossing).

    ; pointer to m24 in RSI
    ; result: EAX = zero_extend(m24)

    test   sil, 111111b     ; test all 6 low bits.  There's no TEST r32, imm8, so  REX r8, imm8 is shorter and never slower.
    jz   .aligned_by_64

    mov    eax, [rsi-1]
    shr    eax, 8
.loaded:

    ...
    ret    ; end of whatever large function this is part of

 ; unlikely block placed out-of-line to keep the common case fast
.aligned_by_64:
    mov    eax, [rsi]
    and    eax, 0x00FFFFFF
    jmp   .loaded

So in the common case, the extra cost is only one not-taken test-and-branch uop.

Depending on the CPU, the inputs, and the surrounding code, testing the low 12 bits (to only avoid crossing 4k boundaries) would trade off better branch prediction for some cache line splits within pages, but still never a cache-line split. (In that case test esi, (1<<12)-1. Unlike testing sil with an imm8, testing si with an imm16 is not worth the LCP stall on Intel CPUs to save 1 byte of code. And of course if you can have your pointer in ra/b/c/dx, you don't need a REX prefix, and there's even a compact 2-byte encoding for test al, imm8.)

You could even do this branchlessly, but clearly not worth it vs. just doing 2 separate loads!

    ; pointer to m24 in RSI
    ; result: EAX = zero_extend(m24)

    xor    ecx, ecx
    test   sil, 7         ; might as well keep it within a qword if  we're not branching
    setnz  cl             ; ecx = (not_start_of_line) ? : 1 : 0

    sub    rsi, rcx       ; normally rsi-1
    mov    eax, [rsi]

    shl    ecx, 3         ; cl = 8 : 0
    shr    eax, cl        ; eax >>= 8  : eax >>= 0

                          ; with BMI2:  shrx eax, [rsi], ecx  is more efficient

    and    eax, 0x00FFFFFF  ; mask off to handle the case where we didn't shift.


True architectural 24-bit load or store

Architecturally, x86 has no 24-bit loads or stores with an integer register as the destination or source. As Brandon points out, MMX / SSE masked stores (like MASKMOVDQU, not to be confused with pmovmskb eax, xmm0) can store 24 bits from an MMX or XMM reg, given a vector mask with only the low 3 bytes set. But they're almost never useful because they're slow and always have an NT hint (so they write around the cache, and force eviction like movntdq). (The AVX dword/qword masked load/store instruction don't imply NT, but aren't available with byte granularity.)

AVX512BW (Skylake-server) adds vmovdqu8 which gives you byte-masking for loads and stores with fault-suppression for bytes that are masked off. (I.e. you won't segfault if the 16-byte load includes bytes in an unmapped page, as long as the mask bits aren't set for that byte. But that does cause a big slowdown). So microarchitecturally it's still a 16-byte load, but the effect on architectural state (i.e. everything except performance) is exactly that of a true 3-byte load/store (with the right mask).

You can use it on XMM, YMM, or ZMM registers.

;; probably slower than the integer way, especially if you don't actually want the result in a vector
mov       eax, 7                  ; low 3 bits set
kmovw     k1, eax                 ; hoist the mask setup out of a loop


; load:  leave out the {z} to merge into the old xmm0 (or ymm0 / zmm0)
vmovdqu8  xmm0{k1}{z}, [rsi]    ; {z}ero-masked 16-byte load into xmm0 (with fault-suppression)
vmovd     eax, xmm0

; store
vmovd     xmm0, eax
vmovdqu8  [rsi]{k1}, xmm0       ; merge-masked 16-byte store (with fault-suppression)

This assembles with NASM 2.13.01. IDK if your NASM is new enough to support AVX512. You can play with AVX512 without hardware using Intel's Software Development Emulator (SDE)

This looks cool because it's only 2 uops to get a result into eax (once the mask is set up). (However, http://instlatx64.atw.hu/'s spreadsheet of data from IACA for Skylake-X doesn't include vmovdqu8 with a mask, only the unmasked forms. Those do indicate that it's still a single uop load, or micro-fused store just like a regular vmovdqu/a)

But beware of slowdowns if a 16-byte load would have faulted or crossed a cache-line boundary. I think it internally does do the load and then discards the bytes, with a potentially-expensive special case if a fault needs to be suppressed.

Also, for the store version, beware that masked stores don't forward as efficiently to loads. (See Intel's optimization manual for more).


Footnotes:

  1. Wide stores are a problem because even if you replace the old value, you'd be doing a non-atomic read-modify-write, which could break things if that byte you put back was a lock, for example. Don't store outside of objects unless you know what comes next and that it's safe, e.g. padding that you put there to allow this. You could lock cmpxchg a modified 4-byte value into place, to make sure you're not stepping on another thread's update of the extra byte, but obviously doing 2 separate stores is much better for performance than an atomic cmpxchg retry loop.

这篇关于如何将3个字节(24位)从内存移动到寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆