为什么反汇编数据成为指令? [英] Why are disassembled data becoming instructions?

查看:56
本文介绍了为什么反汇编数据成为指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一些帮助来了解当 该代码片段"happens":"jmp Begin". 我只了解.com文件可以为64kb,因此您希望将所有内容都放在一个片段中.如果要放置变量,则需要jmp.但是,当我搜索它时,许多指南只是在评论中说jmp Begin只是跳过数据而已.这是我的问题: 此刻到底发生了什么:

它似乎正在运行

        mov     al, a
        mov     bl, b
        sub     al, bl

但是我不明白为什么在Turbo调试器中看起来像这样. 当我将Result的起始值更改为时?大于0的值会更改为其他值,例如,当我将其更改为90时,它看上去完全正常.我对组装完全陌生,我似乎根本无法掌握它.这是我的整个代码:

            .MODEL TINY

Code        SEGMENT

            ORG    100h
            ASSUME CS:Code, DS:Code

Start:
                jmp     Begin
a               EQU     20
b               EQU     10
c               EQU     100
d               EQU     5
Result          DB      ?


Begin:

            mov     al, a
            mov     bl, b
            sub     al, bl
            mov     bl, c
            mul     bl
            mov     bl, d
            div     bl              
            mov     Result, al
            mov     ah, 4ch
            int     21h

Code        ENDS
            END             Start

解决方案

我尝试给您一个解释.

问题在于,在过去(今天部分仍然如此)中,处理器无法区分内存中的代码字节和数据字节.这意味着.com文件中的任何字节都可以用作代码和数据.调试器不知道哪些字节将被执行为代码,哪些字节将被用作数据.在棘手的情况下,字节实际上可以同时用作代码和数据.您的程序可以在内存中创建作为代码有效的数据,您可以跳到该字节上执行该数据.

在很多(但不是全部)情况下,调试器实际上可以找出什么是代码和什么是数据,但是此代码分析可能变得非常复杂,因此大多数调试器/反汇编器根本没有这种代码流分析器.因此,他们只是在文件/内存中选择一个偏移量(通常是当前指令指针),然后从该偏移量开始依次解码汇编指令连续的一系列连续字节,而不会遵循任何jmp指令,直到调试器的屏幕完全充满了足够数量的反汇编线.愚蠢的反汇编程序/调试器不在乎反汇编的字节实际上是用作程序中的指令还是数据,它们将它们视为指令.

如果您正在调试程序,并且调试器在断点处停止,则它将使用当前指令指针,并使用原始的填充调试器屏幕"方法从该偏移量开始再次执行哑子反汇编.

这种连续字节的串行反汇编是一种大多数情况下都有效的简单方法.如果串行解码彼此遵循的非jmp指令,则几乎可以确定处理器将按此顺序执行它们.但是,一旦到达并解码jmp指令,就无法确定以下字节作为代码有效.但是,您可以尝试将它们解码为指令,希望代码中间没有混入数据(是的,在大多数情况下,jmp(或类似的控制流指令)之后都没有数据,这就是调试器的原因给您哑巴分解作为可能有用的预测" ).实际上,大多数代码通常都充满了条件跳转,将它们后面的字节反汇编为调试器提供的非常有用的帮助.在跳转指令之后将数据放在代码中间的情况非常少见,我们可以将其视为边缘情况.

让我们假设您有一个简单的.com程序,该程序只是跳过一些数据,然后存在int 20h:

    jmp start
    db  90h
start:
    int 20h

反汇编程序可能会通过从偏移量0000开始反汇编来告诉您以下内容:

 --> 0000   eb 01        jmp short 0003
    0002   90           nop
    0003   cd 20        int 20h
 

很酷,这看起来就像我们的asm源代码...现在让我们稍微更改一下程序:让我们更改数据...

    jmp start
    db  cdh
start:
    int 20h

现在,反汇编程序将向您显示以下内容:

 --> 0000   eb 01        jmp short 0003
    0002   cd cd        int cdh
    0004   20 ...... whatever...
 

问题是某些指令由1个以上的字节组成,并且调试器不在乎字节是为您表示代码还是数据.在上面的示例中,如果反汇编程序从偏移量0000到程序末尾连续反汇编字节(包括数据),则您的1字节数据将反汇编为2字节指令(窃取"实际代码的第一个字节),因此调试器尝试分解的下一条指令将以偏移0004而不是通常为jmp的0003偏移.在第一个示例中,我们没有这样的问题,因为在将程序的数据部分反汇编后,数据反汇编为一个1字节的指令,并且意外地被反汇编,下一条为调试器反汇编的指令位于偏移量0003处.正是您jmp的目标.

但是,幸运的是,在这种情况下调试器向您显示的内容不是您的程序执行时将发生的情况.通过执行一条指令,程序实际上将跳到偏移量0003,调试器将再次进行哑汇编,但这一次是从偏移量0003开始的,该偏移量位于先前错误汇编的指令中间.

比方说,您调试了第二个示例程序,然后一步一步地执行了其中的所有指令.当您使用指令指针== 0000启动程序时,调试器将显示以下内容:

 --> 0000   eb 01        jmp short 0003
    0002   cd cd        int cdh
    0004   20 ...... whatever...
 

但是,当您触发"step"命令来执行一条指令时,指令指针(IP)变为0003,并且调试器再次从偏移量0003执行哑子拆卸",直到调试器屏幕被填满,您将看到此信息:

 --> 0003   cd 20      int 20h
    0005   ...... whatever...
 

结论:如果您有笨拙的反汇编程序,并且将数据混合到代码的中间(数据周围有jmp),那么笨拙的反汇编程序会将您的数据视为代码,这可能会导致您遇到次要"问题已经遇到了.

具有流程分析功能的高级反汇编程序(例如Ida Pro)将按照跳转说明进行反汇编.在偏移量0000处拆卸jmp后,将发现下一个要拆卸的指令是0003处jmp的目标,并且下一步将拆卸int 20h.它将在偏移量0002处的db cdh字节标记为数据.

其他说明:

您已经注意到(过时的)8086指令集中的一条指令长度可以在1-6个字节之间,但是jmpcall可以按字节粒度跳转到内存中的任何位置.指令的长度通常可以从指令的前1个或2个字节中确定.但是,只有当处理器使用其特殊IP(指令指针寄存器)将指令的第一个字节作为目标并尝试以给定的偏移量执行字节时,字节才粘在一起"成一条指令.让我们看一个棘手的示例:您在内存中的字节eb ff 26 05 00 03 00的偏移量为0000,然后逐步执行它.

 --> 0000   eb ff        jmp short 0001
    0002   26 05 00 03  es: add ax, 300h
    0006   00 ...... whatever...
 

处理器指令指针(IP)指向偏移量0000,因此它对一条指令进行解码,并且其中的字节在执行时粘在一起成为一条指令". (处理器在0000处执行指令解码.)由于第一个字节为eb,所以它知道指令长度为2个字节.调试器也知道这一点,因此它会为您解码指令,并且还会基于错误的假设(在某些时候处理器会在偏移量0002,然后在偏移量0006等处执行一条指令),生成一些其他的错误反汇编.会发现这是不正确的,处理器将字节以不同的偏移量粘贴到指令中.

如您所见,我棘手的字节码包含一个jmp,它会跳转到已执行jmp指令本身中间的偏移量0001 !!!但是,这根本不是问题.处理器不在乎它,并愉快地跳转到偏移量0001,因此下一步它将尝试在此处解码一条指令(或粘在一起的字节").让我们看看处理器将在0001处找到哪种指令:

 --> 0001   ff 26 05 00  jmp word ptr [5]
    0005   03 00        add ax, word ptr [bx+si]
 

如您所见,我们的下一条指令位于0001,调试器基于错误的假设(即处理器将在某个点处到达该偏移量)向我们显示了在偏移量0005处的一些垃圾分解.

0001处的指令告诉处理器从偏移量0005中拾取一个字并将其解释为要跳转到该位置的偏移量.如您所见,word ptr [5]的值为3(作为小端16位值),因此处理器将3放入其IP寄存器(跳至0003).让我们看看它在偏移量0003处找到了什么:

 --> 0003   05 00 03     add ax, 300h
 

很难以调试器的形式显示我棘手的字节码eb ff 26 05 00 03 00的反汇编,因为处理器执行的实际指令位于重叠的存储区中.首先,处理器执行字节0000-0001,然后执行0001-0004,最后执行0003-0005.

在某些较新的risc架构中,指令的长度是固定的,它们必须位于对齐的内存区域,并且不可能跳转到任何地方,因此调试器的工作比x86的情况容易得多.

I need some help to understand what happens in the moment when this fragment of code "happens": "jmp Begin". I understand only that .com file can be 64kb so you want to put everything in one segment. You need to jmp if you want to put variables. But when I search about it, many guides just say in comment that jmp Begin is only to skip data and nothing else. And here is my question: What exactly happens in this moment:

It appears that it runs this

        mov     al, a
        mov     bl, b
        sub     al, bl

But I can't understand why it looks like this in turbo debugger. When I change starting value of Result from ? to something greater than 0 it changes to something else and when I change it for example to 90 it looks completely normal. I am completely new to assembly and I can't seem to grasp it at all. Here is my whole code:

            .MODEL TINY

Code        SEGMENT

            ORG    100h
            ASSUME CS:Code, DS:Code

Start:
                jmp     Begin
a               EQU     20
b               EQU     10
c               EQU     100
d               EQU     5
Result          DB      ?


Begin:

            mov     al, a
            mov     bl, b
            sub     al, bl
            mov     bl, c
            mul     bl
            mov     bl, d
            div     bl              
            mov     Result, al
            mov     ah, 4ch
            int     21h

Code        ENDS
            END             Start

解决方案

I try to give you an explanation.

The problem is that in the old days (and this is partly still true today) the processors didn't differentiate code and data bytes in memory. This means that any byte in your .com file can be used as both code and data. The debugger has no clue which bytes will be executed as code and which bytes will be used as data. A byte can actually be used as both code and data in tricky cases... Your program can create data in memory that is valid as code and you can jump onto it to execute it.

In many (but not all) cases the debugger could actually find out what is code and what is data but this code analysis can get very complex so most debuggers/disassemblers simply don't have such code flow analyzer. For this reason they just pick an offset in your file/memory (this is usually the current instruction pointer) and starting from this offset they decode a series of consecutive bytes as assembly instructions serially without following any jmp instructions until the screen of the debugger is completely filled with enough number of disassembled lines. Dumb disassemblers/debuggers don't care whether the disassembled bytes are actually used as instructions or data in your program, they treat them as instructions.

If you are debugging your program and the debugger stops at a breakpoint then it takes the current instruction pointer and performs a dumb disassembly again starting from that offset with the primitive "fill the debugger screen" method.

This serial disassembly of consecutive bytes is a simple method that works most of the time. If you serially decode non-jmp instructions that follow each other than you can be almost sure that the processor will execute them in this order. However, once you reach and decode a jmp instruction you can't be sure that the following bytes are valid as code. You can however try to decode them as instructions hoping that there is no data mixed into the middle of the code (and yes, in most cases there is no data after a jmp (or similar control flow instruction), this is why debuggers give you a dumb disassembly as a "possibly useful prediction"). In fact, most of the code is usually full of conditional jumps and disassembling the bytes after them as code is very useful help from the debugger. Having data in the middle of the code after a jump instruction is quite rare, we can treat it as an edge case.

Let's assume that you have a simple .com program that just jumps over some data and then exists with an int 20h:

    jmp start
    db  90h
start:
    int 20h

The disassembler would probably tell you something like the following by disassembling starting from offset 0000:

--> 0000   eb 01        jmp short 0003
    0002   90           nop
    0003   cd 20        int 20h

Cool, this looks exactly like our asm source code... Now let's change the program a bit: let's change the data...

    jmp start
    db  cdh
start:
    int 20h

Now the the disassembler will show you this:

--> 0000   eb 01        jmp short 0003
    0002   cd cd        int cdh
    0004   20 ...... whatever...

The problem is that some instructions consist of more than 1 byte and the debugger doesn't care whether bytes represent code or data for you. In the above example if the disassembler serially disassembles bytes from offset 0000 till the end of your program (including your data) then your 1 byte data will disassemble into a 2 byte instruction ("stealing" the first byte of your actual code) so the next instruction the debugger tries to disassemble will come at offset 0004 instead of 0003 where your jmp would normally jump. In the first example we didn't have such a problem because the data disassembled into a 1 byte instruction and accidentally after disassembling the data part of your program the next instruction to disassemble for the debugger was at offset 0003 that is exactly the target of your jmp.

However what the debugger shows to you in this case is fortunately not what will happen when your program gets executed. By executing one instruction the program would actually jump to offset 0003 and the debugger would do a dumb disassembly again but this time starting from offset 0003 that is in the middle of an instruction in the previous incorrect disassembly...

Let's say you debug the second example program and you execute all instruction in it one-by-one. When you start the program with instruction pointer == 0000 the debugger shows this:

--> 0000   eb 01        jmp short 0003
    0002   cd cd        int cdh
    0004   20 ...... whatever...

However when you trigger the "step" command to execute one instruction the instruction pointer (IP) changes to 0003 and the debugger performs a "dumb disassembling" again from offset 0003 till the debugger screen is filled up so you will see this:

--> 0003   cd 20      int 20h
    0005   ...... whatever...

Conclusion: If you have dumb disassemblers and you mix data into the middle of your code (with jmps around the data) then the dumb disassembler will treat your data as code and this may cause the "minor" issue you've encountered.

An advanced disassembler with flow analysis (like Ida Pro) would do the disassembling by following the jump instructions. After disassembling your jmp at offset 0000 it would find out that the next instruction to disassemble is the target of the jmp at 0003 and it would disassemble the int 20h as the next step. It would mark the db cdh byte at offset 0002 as data.

Additional explanation:

As you have already noticed an instruction in (the quite outdated) 8086 instruction set can be anywhere between 1-6 bytes long but a jmp or call can jump anywhere in memory with byte granularity. The length of the instruction can usually be determined from the first 1 or 2 bytes of the instruction. However bytes "stick together" into an instruction only when the processor targets the first byte of the instruction with its special IP (instruction pointer register) and tries to execute the bytes at the given offset. Let's see a tricky example: You have bytes eb ff 26 05 00 03 00 in memory at offset 0000 and you execute it step-by-step.

--> 0000   eb ff        jmp short 0001
    0002   26 05 00 03  es: add ax, 300h
    0006   00 ...... whatever...

The processor instruction pointer (IP) points to offset 0000 so it decodes an instruction and the bytes there "stick together into an instruction" for the time of execution. (The processor performs instruction decoding at 0000.) Since the first byte is eb it knows that the instruction length is 2 bytes. The debugger also knows this so it decodes the instruction for you and also generates some additional buggy disassembly based on the incorrect assumption that at some point the processor would execute an instruction at offset 0002, and then at offset 0006, etc... As you will see this isn't true, the processor will stick together bytes into instructions at quite different offsets.

As you see my tricky byte code contains a jmp that jumps to offset 0001 that is in the middle of the executed jmp instruction itself!!! This however isn't a problem at all. The processor doesn't care about it and happily jumps to offset 0001 so as a next step it will try to decode an instruction (or "stick together bytes") there. Let's see what kind of instruction will the processor find at 0001:

--> 0001   ff 26 05 00  jmp word ptr [5]
    0005   03 00        add ax, word ptr [bx+si]

As you see we have our next instruction at 0001 and the debugger shows us some garbage disassembly at offset 0005 based on the false assumption that the processor will get to that offset at some point...

The instruction at 0001 tells the processor to pick up a word from offset 0005 and interpret it as an offset to jump there. As you see the value of word ptr [5] is 3 (as a little endian 16 bit value) so the processor puts 3 into its IP register (jumps to 0003). Let's see what it finds at offset 0003:

--> 0003   05 00 03     add ax, 300h

It would be difficult to show a disassembly for my tricky byte code eb ff 26 05 00 03 00 in the style of the the debugger because the actual instructions executed by the processor are in overlapping memory areas. First the processor executed bytes 0000-0001, then 0001-0004, and finally 0003-0005.

In some newer risc architectures the length of instructions is fix and they have to be on aligned memory areas and it isn't possible to jump anywhere so the job of a debugger is much easier than in case of x86.

这篇关于为什么反汇编数据成为指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆