如果我有一个 8 位值,使用 8 位寄存器而不是 16、32 或 64 位有什么好处吗? [英] If I have an 8-bit value, is there any advantage to using an 8-bit register instead of say, 16, 32, or 64-bit?

查看:25
本文介绍了如果我有一个 8 位值,使用 8 位寄存器而不是 16、32 或 64 位有什么好处吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过的介绍性 x86 asm 文献似乎在所有实际场景中都坚持使用 32 位寄存器(eax、ebx 等),除了演示 64 位寄存器作为也存在的东西.如果完全提到 16 位寄存器,那只是解释为什么 32 位寄存器在其名称前有一个e"的历史注释.编译器似乎对小于 32 位的寄存器同样不感兴趣.

The introductory x86 asm literature I read just seems to stick with 32-bit registers (eax, ebx, etc) in all practical scenarios except to demonstrate the 64-bit registers as a thing that also exists. If 16-bit registers are mentioned at all, it is as a historical note explaining why the 32-bit registers have an 'e' in front of their names. Compilers seem equally disinterested in less-than-32-bit registers.

考虑以下 C 代码:

int main(void) { return 511; }

尽管 main 声称返回一个 int,但实际上 Linux 退出状态代码是 8 位的,这意味着任何超过 255 的值都将是最低有效的 8 位,即.

Although main purports to return an int, in fact Linux exit status codes are 8-bit, meaning any value over 255 will be the least significant 8-bits, viz.

hc027@HC027:~$ echo "int main(void) { return 511; }" > exit_gcc.c
hc027@HC027:~$ gcc exit_gcc.c 
hc027@HC027:~$ ./a.out 
hc027@HC027:~$ echo $?
255

所以我们看到系统只会使用int main(void)返回值的前8位.然而当我们要求 GCC 提供同一程序的汇编输出时,它会将返回值存储在 8 位寄存器中吗?一起来看看吧!

So we see that only the first 8-bits of int main(void)'s return value will be used by the system. Yet when we ask GCC for the assembly output of that same program, will it store the return value in an 8-bit register? Let's find out!

hc027@HC027:~$ cat exit_gcc.s
    .file   "exit_gcc.c"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $511, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

不!它使用 %eax,一个非常多的 32 位寄存器!现在,GCC 比我聪明,也许 int main(void) 的返回值用于其他不知道返回值在哪里的东西't 被截断为 8 个最低有效位(或者可能是 C 标准规定,无论其实际命运如何,它都必须返回 a for realsy、实际 int)

Nope! It uses %eax, a very-much-32-bit register! Now, GCC is smarter than me, and maybe the return value of int main(void) is used for other stuff that and don't know where it's return value won't be truncated to the 8 least significant bits (or maybe the C standard decrees that it must return a for realsy, actual int no matter what its actual destiny)

但不管我的具体例子的效果如何,问题仍然存在.据我所知,现代 x86 汇编程序员和编译器几乎都忽略了 32 位以下的寄存器.何时使用 16 位寄存器 x86"的粗略谷歌没有返回相关答案.我很好奇:在 x86 CPU 中使用 8 位和 16 位寄存器有什么好处吗?

But regardless of the efficacy of my specific example, the question stands. As far as I can tell, the registers under 32-bits are pretty much neglected by modern x86 assembly programmers and compilers alike. A cursory Google of "when to use 16-bit registers x86" returns no relevant answers. I'm pretty curious: is there any advantage to using the 8 and 16-bit registers in x86 CPUs?

推荐答案

所以,其实不必如此,这里有一些历史.尝试运行

So, it doesn't really have to be that way, there's a bit of history going on here. Try running

    mov rax, -1 # 0xFFFFFFFFFFFFFFFF
    mov eax, 0
    print rax

在您最喜欢的 x86 桌面上(print 取决于您的环境/语言/任何内容).你会注意到,即使 rax 一开始都是 1,并且你认为你只清除了底部的 32 位,print 语句打印出零!写入 eax 完全擦除 rax.为什么?这是非常奇怪和不直观的行为.原因很简单:因为它要快得多.当您继续写入 eax 时,试图保持 rax 的较高值绝对是一件痛苦的事情.

On your favorite x86 desktop (print being based on your environment/language/whatever). What you'll notice is that even though rax started out with all ones, and you think you only wiped out the bottom 32bits, the print statement prints zero! Writes to eax completely wipe rax. Why? That's awfully weird and unintuitive behavior. The reason is simple: Because it's much faster. Trying to maintain the higher values of rax is an absolute pain when you keep writing to eax.

然而,英特尔/AMD 在最初决定转向 32 位时并没有意识到这一点,并犯了一个致命的错误,永远让 al/ah 成为只是一个历史遗迹:当你写alah 时,对方不会被破坏!这确实更直观,这在 16 位时代曾经是一个好主意,因为现在您拥有两倍的寄存器,您拥有一个 32 位寄存器!但是,如今,随着向大量寄存器的转移,我们不再需要更多的寄存器了.我们真正想要的是更快寄存器,并推动更多的 GHz.从这个角度来看,每次写入alah,处理器都需要保留另一半,这从根本上来说要贵得多.(稍后解释原因)

Intel/AMD however, didn't realize this back when they originally decided to move onto 32bit, and made a fatal error that forever left al/ah to be nothing but a historical relic: When you write to al or ah, the other doesn't get clobbered! This does make more intuitive sense, and it was once a great idea in a 16bit era, because now you have twice as many registers, and you have a 32bit register! But, nowadays, with the move to an abundance of registers, we just don't need more registers anymore. What we really want are faster registers, and to push more GHz. From this point of view, every time you write to al or ah, the processor needs to preserve the other half, which is fundamentally just much more expensive. (Explanation on why, later)

理论已经足够了,让我们进行一些实际测试.每个测试用例都测试了 3 次.这些测试在 Intel Core i5-4278U CPU @ 2.60GHz

Enough with the theory, let's get some real tests. Each testcase was tested three times. These tests were run on an Intel Core i5-4278U CPU @ 2.60GHz

只有 rax:1.067s、1.072s、1.097s

Only rax: 1.067s, 1.072s, 1.097s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov rax, 5
mov rax, 5
mov rax, 6
mov rax, 6
mov rax, 7
mov rax, 7
mov rax, 8
mov rax, 8
dec ecx
jmp loop
exit:
ret

只有 eax:1.072s、1.062s、1.060s

Only eax: 1.072s, 1.062s, 1.060s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov eax, 5
mov eax, 5
mov eax, 6
mov eax, 6
mov eax, 7
mov eax, 7
mov eax, 8
mov eax, 8
dec ecx
jmp loop
exit:
ret

只有啊:2.702s、2.748s、2.704s

Only ah: 2.702s, 2.748s, 2.704s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov ah, 5
mov ah, 6
mov ah, 6
mov ah, 7
mov ah, 7
mov ah, 8
mov ah, 8
dec ecx
jmp loop
exit:
ret

只有 ah/al: 1.432s, 1.457s, 1.427s

Only ah/al: 1.432s, 1.457s, 1.427s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
mov ah, 6
mov al, 6
mov ah, 7
mov al, 7
mov ah, 8
mov al, 8
dec ecx
jmp loop
exit:
ret

ah 和 al,然后是 eax:1.117s、1.084s、1.082s

ah and al, then eax: 1.117s, 1.084s, 1.082s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
mov eax, 6
mov al, 6
mov ah, 7
mov eax, 7
mov ah, 8
mov al, 8
dec ecx
jmp loop
exit:
ret

(请注意,这些测试与部分寄存器停顿无关,因为我在写入 ah 后没有读取 eax.参考注释在主要帖子上.)

(Note that these tests don't have to do with partial register stall, as I'm not reading eax after writes to ah. In reference to the comments on the main post.)

正如您从测试中看到的,使用 al/ah 慢很多.使用 eax/rax 将其他时间吹出水面,并且 rax 和 eax 本身之间基本上没有性能差异.如前所述,原因是因为 eax/rax 直接覆盖了整个寄存器.但是,使用ah或al意味着另一半需要维护.

As you can see from the tests, using al/ah is much slower. Using eax/rax blow the other times out of the water, and, there is fundamentally no performance difference between rax and eax themselves. As discussed, the reason why is because eax/rax directly overwrite the entire register. However, using ah or al means that the other half needs to be maintained.

现在,如果您愿意,我们可以深入解释为什么在每次使用时擦除寄存器更有效.从表面上看,这似乎无关紧要,只需更新重要的部分,对吗?有什么大不了的?

Now, if you wish, we can delve into the explanation of why it's more efficient to just wipe the register on every usage. On face value, it doesn't seem like it'll matter, just only update the bits that matter, right? What's the big deal?

好吧,现代 CPU 是智能的,它们会非常积极地并行化 CPU 知道不会相互干扰的操作,但前提是这种并行化实际上是可能的.例如,如果您将eax 移至ebx,然后将ebx 移至ecx,然后将ecx 移至edx,则CPU 无法对其进行并行化,并且它会比平时运行得更慢.但是,如果您写入 eax、写入 ebx、写入 ecx 和写入 edx,那么 CPU 可以并行化所有这些操作,并且运行速度会比平时快得多!您可以自行测试.

Well, Modern CPU's are intelligent, they will very aggressively parallelize operations that the CPU knows can't interfere with eachother, but only when such parallelization is actually possible. For example, if you mov eax to ebx, then ebx to ecx, then ecx to edx, then the CPU cannot parallelize it, and it will run slower than usual. However if you write to eax, write to ebx, write to ecx, and write to edx, then the CPU can parallelize all of those operations, and it will run much faster than usual! Feel free to test this on your own.

在内部,实现方式是立即开始执行和计算指令,即使较早的指令仍在执行中.但是,主要限制如下:

Internally, the way this is implemented is by immediately starting to execute and calculate an instruction, even if earlier instructions are still in the midst of being executed. However, the primary restriction is the following:

  • 如果较早的指令写入到某个寄存器 A,而当前指令从某个寄存器 A读取,则当前指令必须等待直到前面的指令全部完成,这就是导致此类减速的原因.
  • If an earlier instructions writes to some register A, and the current instruction reads from some register A, then the current instruction must wait until the earlier instruction as been completed in its entirety, which is what causes these kinds of slowdowns.

在我们的 mov eax, 5 垃圾邮件测试中,耗时约 1 秒,CPU 可以积极地并行运行所有操作,因为无论如何都没有读取任何指令,它们都是只写.它只需要确保最近的写入是该寄存器在任何未来读取期间保存的值(这很容易,因为即使操作都发生在重叠的时间段内,最后开始的也将完成最后一个).

In our mov eax, 5 spam test, which took ~1 second, the CPU could aggressively run all of the operations in parallel, because none of the instructions read from anything anyway, they were all write-only. It only needs to ensure that the most recent write is the value that the register holds during any future reads (Which is easy, because even though the operations all occur in overlapping time periods, the one that was started the last will also finish the last).

mov ah, 5 垃圾邮件测试中,它比 mov eax, 5 垃圾邮件测试慢 2.7 倍,因为基本上没有简单的并行化方法操作.每个操作都被标记为reading from eax",因为它依赖于之前eax的值,并且它也被标记为write to eax",因为它修改了的值eax.如果一个操作必须从 eax 读取,它必须在前一个操作完成写入 eax 之后发生.因此,并行化会受到很大影响.

In the mov ah, 5 spam test, it was a painful 2.7x slower than the mov eax, 5 spam test, because there's fundamentally no easy way to parallelize the operations. Each operation is marked as "reading from eax", since it depends on the previous value of eax, and it's also marked as "writing to eax", because it modifies the value of eax. If an operation must read from eax, it must occur after the previous operation has finished writing to eax. Thus, parallelization suffers dramatically.

此外,如果您想自己尝试,您会注意到 add eax, 5 垃圾邮件和 add ah, 5 垃圾邮件的数量完全相同时间(在我的 CPU 上为 2.7 秒,与 mov ah, 5 完全相同!).在这种情况下,add eax, 5 被标记为read from eax",并标记为write to eax",所以它接收到与 mov ah, 5 完全相同的减速,它也必须同时读取和写入 eax!实际的 mov 与 add 无关紧要,逻辑门将在 ALU 的单个滴答中通过所需的操作立即将输入连接到输出.

Also, if you want to try on your own, you'll notice that add eax, 5 spamming and add ah, 5 spamming both take exactly the same amount of time (2.7s on my CPU, exactly the same as mov ah, 5!). In this case, add eax, 5 is marked as "read from eax", and as "write to eax", so it receives exactly the same slowdown as mov ah, 5, which must also both read and write to eax! The actual mov vs add doesn't matter, the logic gates will immediately connect the input to the output via the desired operation in a single tick of the ALU.

所以,我希望这能说明为什么 eax 的 64 位覆盖功能会导致比 ah 的保存系统更快的时间.

So, I hope that shows why eax's 64bit overwrite feature leads to times that are faster than ah's preservation system.

不过这里还有更多细节,为什么 ah/al 交换测试需要 1.43 秒快得多的时间?好吧,最有可能发生的事情是寄存器重命名有助于处理所有mov ah, 5;".移动,5"写道.看起来 CPU 足够聪明,可以拆分啊"和al"他们自己的完整 64 位寄存器,因为他们使用eax"的不同部分.无论如何注册.因此,这允许并行进行连续的一对 ahal 操作,从而节省大量时间.如果eax"被完整地读取,CPU 需要将两个al"合并在一起.vs 啊"寄存器回到一个寄存器,导致显着减速(稍后显示).在较早的mov ah, 5"-only 测试中,不可能将 eax 拆分为单独的寄存器,因为我们使用了ah"反正每次都是.

There are a couple more details here though, why did the ah/al swap test take a much faster 1.43 seconds? Well, most likely what's happening is that register renaming is helping with all of the "mov ah, 5; mov al, 5" writes. It looks like the CPU was intelligent enough to split "ah" and "al" their own full 64bit registers, since they use different parts of the "eax" register anyway. This thus allows the consecutive pair of ah then al operations to be made in parallel, saving significant time. If "eax" is ever read in its entirety, the CPU would need to coalesce the two "al" vs "ah" registers back into one register, causing a significant slowdown (Shown later). In the earlier "mov ah, 5"-only test, it wasn't possible to split eax into separate registers, be cause we used "ah" every single time anyway.

而且,有趣的是,如果您查看 ah/al/eax 测试,您会发现它几乎与 eax 测试一样快!在这种情况下,我预测所有三个都有自己的寄存器,因此代码非常并行化.

And, interestingly, if you look at the ah/al/eax test, you can see that it was almost as fast as the eax test! In this case, I'm predicting that all three got their own registers and the code was thus extremely parallelized.

当然,如前所述,当必须合并 ah/al 时,尝试在该循环中的任何位置读取 eax 都会降低性能,这是一个示例:

Of course, as mentioned, attempting to read eax anywhere in that loop is going to kill performance when ah/al will have to be coalesced, here's an example:

时间:3.412s、3.390s、3.515s

Times: 3.412s, 3.390s, 3.515s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
xor eax, 5
mov al, 6
mov ah, 8
xor eax, 5
mov al, 8
dec ecx
jmp loop
exit:
ret

但是,请注意,上面的测试没有适当的控制组,因为它使用 xor 而不是 mov(例如,如果只使用xor"会怎样,这就是它慢的原因).所以,这里有一个测试来比较它:

But, note that the above test doesn't have a proper control group as it uses xor instead of mov (E.g., What if just using "xor" is the reason why it's slow). So, here's a test to compare it to:

时间:1.426s、1.424s、1.392s

Times: 1.426s, 1.424s, 1.392s

global _main
_main:
mov ecx, 1000000000
loop:
test ecx, ecx
jz exit
mov ah, 5
mov al, 5
xor ah, 5
mov al, 6
mov ah, 8
xor ah, 5
mov al, 8
dec ecx
jmp loop
exit:
ret

上述测试非常积极地合并,这导致可怕的 3.4 秒,实际上比任何其他测试都慢得多.但是,al/ah 测试将 al/ah 分成两个不同的寄存器,因此运行速度非常快,比仅使用 ah 快,因为连续的 ah/al 操作可以并行化.因此,这是英特尔愿意做出的权衡.

The above test coalesces very aggressively, which causes the horrible 3.4 seconds that's in-fact far slower than any of other tests. But, the al/ah test splits al/ah into two different registers and thus runs pretty fast, faster than only using ah because consecutive ah/al operations can be parallelized. So, that was a trade-off that Intel was willing to make.

如前所述,正如所见,你是否做 xor vs add vs mov 并不重要,这上面啊/al 仍然需要 1.4 秒,按位/add/mov 都只是用很少的逻辑门直接将输入连接到输出,使用哪种操作并不重要(但是,mul 和 div 确实会更慢,这需要更严格的计算,因此需要几个微循环).

As mentioned, and as seen, it just doesn't really matter whether you do xor vs add vs mov, this above ah/al still takes 1.4 seconds, bitwise / add / mov all simply directly hook-up the input to the output with very few logic gates, it just doesn't matter which operation you use (However, mul and div will indeed be slower, that requires tougher computation and thus several micro-cycles).

过去的两个测试显示报告的部分寄存器停顿,老实说,我一开始甚至没有考虑过.我首先认为寄存器重命名将有助于缓解这个问题,他们似乎在 ah/al 混合和 ah/al/eax 混合中这样做.然而,使用脏的 ah/al 值读取 eax 是残酷的,因为处理器现在必须组合 ah/al 寄存器.看起来处理器制造商认为寄存器重命名部分寄存器仍然值得,这是有道理的,因为大多数使用 ah/al 的工作不涉及对 eax 的读取,如果这是您的计划,您只需从 ah/al 读取.这样,稍微摆弄 ah/al 的紧密循环会大有裨益,唯一的害处是下次使用 eax 时会打嗝(此时 ah/al 可能不再使用了).

The past two tests show the reported partial register stall, which to be honest I hadn't even considered at first. I first thought register renaming would help mitigate the problem, which they appear to do in the ah/al mixes and ah/al/eax mixes. However, reads to eax with dirty ah/al values are brutal because the processor now has to combine the ah/al registers. It looks like processor manufactures believed register renaming partial registers was still worth it though, which makes sense since most work with ah/al don't involve reads to eax, you would just read from ah/al if that was your plan. This way, tight loops that bit fiddle with with ah/al benefit greatly, and the only harm is a hiccup on the next use of eax (At which point ah/al are probably not going to be used anymore).

如果英特尔想要,而不是 ah/al 寄存器重命名优化提供 1.4 秒,正常 ah 为 2.7 秒,寄存器合并滥用需要 3.4 秒,英特尔可能不会关心寄存器重命名,所有这些测试都会完全相同的 2.7 秒.但是,英特尔很聪明,他们知道那里有很多代码想要使用 ah 和 al,但是找到使用 al 和 ah 很多的代码并不常见,同时还一直从总 eax 中读取作为嗯.

If Intel wanted, rather than the ah/al register renaming optimization giving 1.4 seconds, normal ah being 2.7 seconds, and register coalescing abuse taking 3.4 seconds, Intel could have not cared about register renaming and all of those tests would have been the exact same 2.7 seconds. But, Intel is smart, they know that there's code out there that will want to use ah and al a lot, but it's not common to find code that uses al and ah a lot, while also reading from the total eax all the time as well.

总的来说,即使在没有部分寄存器停顿的情况下,写入 ah 仍然比写入 eax 慢得多,这正是我试图解决的问题.

Overall, even in the case of no partial register stall, writes to ah are still much slower than writes to eax, which is what I was trying to get across.

当然,结果可能会有所不同.其他处理器(很可能是非常老的处理器)可能有控制位来关闭一半的总线,这将允许总线在需要时像 16 位或 8 位总线一样工作.这些控制位必须通过逻辑门沿着输入连接到寄存器,这会稍微减慢寄存器的任何和所有使用,因为现在在寄存器可以更新之前还要经过一个门.由于此类控制位在绝大多数情况下都处于关闭状态(因为很少会混淆 8 位/16 位值),因此英特尔似乎决定不这样做(有充分的理由).

Of course, results may vary. Other processors (Most likely very old ones) might have control bits to shut off half of the bus, which would allow the bus to act like a 16bit or 8bit bus when it needs to. Those control bits would have to be connected via logic gates along the input to the registers, which would slightly slow down any and all usage of the register since now that's one more gate to go through before the register can update. Since such control bits would be off the vast majority of the time (Since it's rare to mess with 8bit/16bit values), it looks like Intel decided not to do that (For good reason).

这篇关于如果我有一个 8 位值,使用 8 位寄存器而不是 16、32 或 64 位有什么好处吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆