:lower16, :upper16 用于 aarch64;绝对地址进入寄存器; [英] :lower16, :upper16 for aarch64; absolute address into register;

查看:108
本文介绍了:lower16, :upper16 用于 aarch64;绝对地址进入寄存器;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将 32 位绝对地址放入 AArch64 上的寄存器中.(例如,MMIO 地址,与 PC 无关).

I need to put a 32-bit absolute address into a register on AArch64. (e.g. an MMIO address, not PC-relative).

在 ARM32 上可以使用 lower16 &upper16 将地址加载到寄存器中

On ARM32 it was possible to use lower16 & upper16 to load an address into a register

movw    r0, #:lower16:my_addr
movt    r0, #:upper16:my_addr

有没有办法在 AArch64 上使用 movk 做类似的事情?

Is there a way to do similar thing on AArch64 by using movk?

如果代码重定位,我还是要相同的绝对地址,所以adr不合适.

ldr 来自附近的文字池会起作用,但我宁愿避免这种情况.

ldr from a nearby literal pool would work, but I'd rather avoid that.

推荐答案

如果你的地址是assemble-time常量,而不是link-time,这非常简单.它只是一个整数,你可以手动拆分它.

If your address is an assemble-time constant, not link-time, this is super easy. It's just an integer, and you can split it up manually.

我让 gcc 和 clang 编译 unsigned abs_addr() { return 0x12345678;} (去dbolt)

I asked gcc and clang to compile unsigned abs_addr() { return 0x12345678; } (Godbolt)

// gcc8.2 -O3
abs_addr():
    mov     w0, 0x5678               // low half
    movk    w0, 0x1234, lsl 16       // high half
    ret

(Writing w0 隐式零扩展到 64-位 x0,与 x86-64 相同).

(Writing w0 implicitly zero-extends into 64-bit x0, same as x86-64).

或者如果你的常量只是一个链接时常量并且你需要在.o中生成重定位供链接器填写,GAS手册记录了什么你可以这样做,在 AArch64 机器特定部分:

Or if your constant is only a link-time constant and you need to generate relocations in the .o for the linker to fill in, the GAS manual documents what you can do, in the AArch64 machine-specific section:

MOVZ"和MOVK"指令的重定位可以通过用 #:abs_g2: 等前缀标签.例如加载foox0 的 48 位绝对地址:

Relocations for ‘MOVZ’ and ‘MOVK’ instructions can be generated by prefixing the label with #:abs_g2: etc. For example to load the 48-bit absolute address of foo into x0:

    movz x0, #:abs_g2:foo     // bits 32-47, overflow check
    movk x0, #:abs_g1_nc:foo  // bits 16-31, no overflow check
    movk x0, #:abs_g0_nc:foo  // bits  0-15, no overflow check

GAS 手册的示例不是最佳的;至少在某些 AArch64 CPU 上,从低到高的效率更高(见下文).对于 32 位常量,遵循 gcc 用于数字文字的相同模式.

The GAS manual's example is sub-optimal; going low to high is more efficient on at least some AArch64 CPUs (see below). For a 32-bit constant, follow the same pattern that gcc used for a numeric literal.

 movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
 movk x0, #:abs_g1:foo              // bits 16-31, overflow check

#:abs_g1:foo 已知其可能设置的位在 16-31 范围内,因此汇编器知道在编码时使用 lsl 16movk.您不应在此处使用明确的 lsl 16.

#:abs_g1:foo will is known to have its possibly-set bits in the 16-31 range, so the assembler knows to use a lsl 16 when encoding movk. You should not use an explicit lsl 16 here.

我选择了 x0 而不是 w0 因为这是 gcc 为 unsigned long long 所做的.可能所有 CPU 的性能都相同,代码大小也相同.

I chose x0 instead of w0 because that's what gcc does for unsigned long long. Probably performance is identical on all CPUs, and code size is identical.

.text
func:
   // efficient
     movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
     movk x0, #:abs_g1:foo              // bits 16-31, overflow check

   // inefficient but does assemble + link
   //  movz x1, #:abs_g1:foo              // bits 16-31, overflow check
   //  movk x1, #:abs_g0_nc:foo           // bits  0-15, no overflow check

.data
foo: .word 123       // .data will be in a different page than .text

使用 GCC:aarch64-linux-gnu-gcc -nostdlib aarch-reloc.s 构建和链接(只是为了证明我们可以,如果你实际上运行了它),然后 aarch64-linux-gnu-objdump -drwC a.out:

With GCC: aarch64-linux-gnu-gcc -nostdlib aarch-reloc.s to build and link (just to prove we can, this will just crash if you actually ran it), and then aarch64-linux-gnu-objdump -drwC a.out:

a.out:     file format elf64-littleaarch64


Disassembly of section .text:

000000000040010c <func>:
  40010c:       d2802280        mov     x0, #0x114                      // #276
  400110:       f2a00820        movk    x0, #0x41, lsl #16

<小时>

Clang 似乎在这里有一个错误,使其无法使用:它只组装 #:abs_g1_nc:foo(不检查高半部分)和 #:abs_g0:foo(下半部分的溢出检查).这是向后的,当 foo 具有 32 位地址时,会导致链接器错误(g0 溢出).我在 x86-64 Arch Linux 上使用 clang 7.0.1 版.


Clang appears to have a bug here, making it unusable: it only assembles #:abs_g1_nc:foo (no check for the high half) and #:abs_g0:foo (overflow check for the low half). This is backwards, and results in a linker error (g0 overflow) when foo has a 32-bit address. I'm using clang version 7.0.1 on x86-64 Arch Linux.

$ clang -target aarch64 -c aarch-reloc.s
aarch-reloc.s:5:15: error: immediate must be an integer in range [0, 65535].
     movz x0, #:abs_g0_nc:foo
              ^

作为一种解决方法 g1_nc 而不是 g1 很好,你可以在没有溢出检查的情况下生活.但是你需要 g0_nc,除非你有一个可以禁用检查的链接器.(或者,也许某些 clang 安装带有与 clang 发出的重定位错误兼容的链接器?)我正在使用 GNU ld (GNU Binutils) 2.31.1 和 GNU gold (GNU Binutils 2.31.1) 1.16 进行测试

As a workaround g1_nc instead of g1 is fine, you can live without overflow checks. But you need g0_nc, unless you have a linker where checking can be disabled. (Or maybe some clang installs come with a linker that's bug-compatible with the relocations clang emits?) I was testing with GNU ld (GNU Binutils) 2.31.1 and GNU gold (GNU Binutils 2.31.1) 1.16

$ aarch64-linux-gnu-ld.bfd aarch-reloc.o 
aarch64-linux-gnu-ld.bfd: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
aarch64-linux-gnu-ld.bfd: aarch-reloc.o: in function `func':
(.text+0x0): relocation truncated to fit: R_AARCH64_MOVW_UABS_G0 against `.data'

$ aarch64-linux-gnu-ld.gold aarch-reloc.o 
aarch-reloc.o(.text+0x0): error: relocation overflow in R_AARCH64_MOVW_UABS_G0

<小时>

MOVZ 与 MOVK 与 MOVN

movz = move-zero 将 16 位立即数放入左移 0、16、32 或 48(并清除其余位)的寄存器中.您总是希望以 movz 开始这样的序列,然后是 movk 其余部分. (movk = move-keep.移动16 位立即数存入寄存器,其他位保持不变.)


MOVZ vs. MOVK vs. MOVN

movz = move-zero puts a 16-bit immediate into a register with a left-shift of 0, 16, 32 or 48 (and clears the rest of the bits). You always want to start a sequence like this with a movz, and then movk the rest of the bits. (movk = move-keep. Move 16-bit immediate into register, keeping other bits unchanged.)

mov 是一种可以选择 movz 的伪指令,但我刚刚用 GNU binutils 和 clang 进行了测试,你需要一个显式的 movz(不是 mov),带有类似 #:abs_g0:foo 的立即数.显然,与数字文字不同,汇编器不会推断它需要 movz 那里.

mov is sort of a pseudo-instruction that can pick movz, but I just tested with GNU binutils and clang, and you need an explicit movz (not mov) with an immediate like #:abs_g0:foo. Apparently the assembler won't infer that it needs movz there, unlike with a numeric literal.

对于狭窄的立即数,例如0xFF000 在两个对齐的 16 位值块中具有非零位,mov w0, #0x18000 将选择 位掩码 - mov 的直接形式,它实际上是一个ORR 的别名 - 立即使用零寄存器.AArch64 位掩码立即数使用强大的编码方案来处理位范围的重复模式.(因此,例如 和 x0, x1, 0x5555555555555555(仅保留偶数位)可以在单个 32 位宽指令中编码,非常适合位黑客.)

For a narrow immediate, e.g. 0xFF000 which has non-zero bits in two aligned 16-bit chunks of the value, mov w0, #0x18000 would pick the bitmask-immediate form of mov, which is actually an alias for ORR-immediate with the zero register. AArch64 bitmask-immediates use a powerful encoding scheme for repeated patterns of bit-ranges. (So e.g. and x0, x1, 0x5555555555555555 (keep only the even bits) can be encoded in a single 32-bit-wide instruction, great for bit-hacks.)

还有 movn(不移动)可以翻转位.这对于负值很有用,允许您将所有高位设置为 1.根据 AArch64 重定位前缀,它甚至还有一个重定位.

There's also movn (move not) which flips the bits. This is useful for negative values, allowing you to have all the upper bits set to 1. There's even a relocation for it, according to AArch64 relocation prefixes.

Cortex A57 优化手册

4.14 快速文字生成

4.14 Fast literal generation

Cortex-A57 r1p0 及更高版本支持针对 32 位和 64 位代码的优化文字生成

Cortex-A57 r1p0 and later revisions support optimized literal generation for 32- and 64-bit code

    MOV wX, #bottom_16_bits
    MOVK wX, #top_16_bits, lsl #16

[和其他例子]

... 如果这些序列中的任何一个按程序代码中描述的顺序依次出现,则两条指令与它们不按顺序出现在程序中相比,可以以更低的延迟和更高的带宽执行代码,在一个周期内生成 32 位文字,在两个周期内生成 64 位文字.

... If any of these sequences appear sequentially and in the described order in program code, the two instructions can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.

序列包括 movz low16 + movk high16 到 x 或 w 寄存器中,按此顺序.(并且还背靠背 movk 设置高 32,再次按低,高的顺序.)根据手册,两条指令都必须使用 w,或者都必须使用 x 寄存器.

The sequences include movz low16 + movk high16 into x or w registers, in that order. (And also back-to-back movk to set the high 32, again in low, high order.) According to the manual, both instructions have to use w, or both have to use x registers.

如果没有特殊支持,movk 将不得不等待 movz 结果准备好作为 ALU 操作的输入来替换该 16 位块.大概在管道中的某个时刻,这 2 条指令合并为一个 32 位立即数 movz 或 movk,从而移除了依赖链.

Without special support, the movk would have to wait for the movz result to be ready as an input for an ALU operation to replace that 16-bit chunk. Presumably at some point in the pipeline, the 2 instructions merge into a single 32-bit immediate movz or movk, removing the dependency chain.

这篇关于:lower16, :upper16 用于 aarch64;绝对地址进入寄存器;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆