将数据与 ARM 中的指令区分开来 [英] Differentiate data from instructions in ARM

查看:27
本文介绍了将数据与 ARM 中的指令区分开来的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在(32 位)ARM Linux 内核中,如何区分嵌入在代码段中的数据和指令?

In (32-bit) ARM Linux kernels, how to differentiate data embedded in the code section, from instructions?

最好有一个轻量级的方法,比如位掩码,可以很容易地实现.将反编译器嵌入内核是不明智的.

It is better to have a light-weight approach, like bit masks, which can be easily implemented. It is not wise to embed a dissembler into the kernel.

推荐答案

一般来说,你要求的是不可能的.

In general, what you're asking for is impossible.

考虑一下这个函数,它碰巧使用了一个太大而无法编码为立即数的数据值:

Consider this function which happens to use a data value too big to encode as an immediate:

@ void patch_nop(void *code_addr);
patch_nop:
    ldr r1, =0xe1a00000
    str r1, [r0]
    bx lr

当它通过汇编程序并返回时,它看起来像这样:

which, by the time it's been through an assembler and back, looks like this:

$ arm-none-eabi-objdump -d a.out

a.out:     file format elf32-littlearm


Disassembly of section .text:

    00000000 <patch_nop>:
       0:   e59f1004        ldr     r1, [pc, #4]    ; c <patch_nop+0xc>
       4:   e5801000        str     r1, [r0]
       8:   e12fff1e        bx      lr
       c:   e1a00000        .word   0xe1a00000

多亏了 ELF 数据,我们仍然可以确定函数的结束位置和文字池的开始位置,但是 objdump 挖掘这些部分和符号的工作几乎不是轻量级"的,谁说你有这些?如果您只有代码怎么办?

Thanks to the ELF data, we can still ascertain where the function ends and the literal pool begins, but the work objdump is doing to dig through the sections and symbols is hardly 'lightweight', and who says you have those anyway? What if you have just the code?

$ arm-none-eabi-objcopy -Obinary a.out bin
$ arm-none-eabi-objdump -D -marm -bbinary bin

bin:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   e59f1004        ldr     r1, [pc, #4]    ; 0xc
   4:   e5801000        str     r1, [r0]
   8:   e12fff1e        bx      lr
   c:   e1a00000        nop                     ; (mov r0, r0)

那里.嵌入在您的指令流中,您有数据,这是一条指令.甚至不小心碰巧看起来像指令的数据.实际上,您无法仅从这 32 位中推断出它们不会被执行(嗯,至少不是从 那个 位置).

There. Embedded in your instruction stream, you have data, which is an instruction. Not even data which accidentally happens to look like an instruction. There is literally nothing you can take from those 32 bits alone to infer that they are not going to be executed (well, not from that location at least).

有一些启发式方法可能有助于做出有根据的猜测,特别是如果可以假设任何额外的先验知识来缩小范围:

There are a few heuristics which might help make an educated guess, particularly if any additional prior knowledge can be assumed to narrow it down:

  • 任何可以被编码为立即数的东西几乎肯定是一条指令,因为编译器/汇编器一开始就不会将它作为文字发出.但是,理想情况下,您至少想知道前面的代码是 ARM 还是 Thumb,以便知道合适的立即范围是什么*.

  • Anything which can be encoded as an immediate is almost certainly an instruction, because a compiler/assembler wouldn't have emitted it as a literal in the first place. However, you'd ideally want to know at least whether the preceding code is ARM or Thumb in order to know what the appropriate immediate range is*.

任何未定义的指令通常都是数据,除非它的代码想要故意引发 undef 异常.而且您基本上必须拥有大部分反汇编程序来检查某些内容是否与任何定义的编码不匹配.在 ARM/Thumb 之上.

Anything which is an undefined instruction is usually going to be data, unless it so happens that it's code which wants to intentionally raise an undef exception. And you essentially have to have most of a disassembler to check that something doesn't match any defined encoding. On top of the ARM/Thumb thing.

紧跟在无条件分支之后的任何内容都可能是文字数据,特别是如果您有符号并且可以判断它非常接近以下函数的开头,或者如果您对要查找的数据有一定的了解它看起来像数据.如果您只是目测反汇编,后一点肯定是相关的 - 实际上,文字数据往往是地址之类的东西,一旦您将代码作为一个整体查看,它们通常会像拇指一样突出.

Anything immediately following an unconditional branch might be literal data, particularly if you have symbols and can tell it's very close to the start of the following function, or if you have some knowledge of the data you're looking for and it looks like data. The latter point is certainly relevant if you're just eyeballing disassembly - in practice literal data tends to be stuff like addresses, which generally stand out like a sore thumb once you look at the code as a whole.

检查某个内容是否为文字的最可靠方法是查看前面的代码(最多 1025 条指令),检查针对该地址的 PC 相关负载.您只需要检查文字加载编码(这是您的简单位掩码操作),然后在找到相对偏移量时解码.理想情况下,您希望解决 ARM/Thumb 问题,以避免检查不适当的编码时出现误报,并且在最绝对病态的情况下,您仍然可能在前面的文字池中遇到一些数据,这些数据恰好看起来像文字负载目标你的地址;永远不要说永远.

The most reliable way to check if something is a literal is to look through the preceding code (up to 1025 instructions away) checking for a PC-relative load targeting that address. You'd only need to check against literal load encodings (there's your simple bitmasking operation), then decode the relative offset if you find one. Ideally you'd want to solve the ARM/Thumb thing to avoid false positives from checking against inappropriate encodings, and in the most absolutely pathological case you could still run into some data in a preceding literal pool which happens to look like a literal load targeting your address; never say never.

当然,这仍然是假设编译器/汇编器自动发出的文字池;当谈到完全手写的汇编代码时,所有的赌注都没有了:

And of course, that's still all assuming literal pools automatically emitted by a compiler/assembler; when it comes to entirely handwritten assembly code, all bets are off:

patch_nop2:
    ldr r1, [pc, #-4]
    mov r0, r0
    str r1, [r0]
    bx lr

是代码吗?是的.是数据吗?是的.

Is is code? Yes. Is it data? Yes.

* 顺便说一句,区分 ARM 和 Thumb 代码归结为与这个问题基本相同的问题 - 这个位模式是什么意思?"- 并且在没有外部帮助的情况下同样重要.

†​​ 没有双关语

这篇关于将数据与 ARM 中的指令区分开来的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆