为什么MOV AH,BH和MOV AL,BL一起比单指令MOV AX快得多,BX? [英] Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

查看:594
本文介绍了为什么MOV AH,BH和MOV AL,BL一起比单指令MOV AX快得多,BX?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现

mov al, bl
mov ah, bh

mov ax, bx

任何人能解释我为什么?
我在Windows XP上的Core 2 Duo 3GHz的运行,在32位模式。
使用NASM编译,然后用VS2010链接。
NASM编译命令:

Can anyone explain me why? I'm running on Core 2 Duo 3 Ghz, in 32-bit mode under Windows XP. Compiling using NASM and then linking with VS2010. Nasm compile command:

nasm -f coff -o triangle.o triangle.asm

下面是我使用来呈现一个三角形的主循环:

Here is the main loop I'm using to render a triangle:

; some variables on stack
%define cr  DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]

loop:

add esi, dcg
mov eax, esi
shr eax, 8

add edi, dcb
mov ebx, edi
shr ebx, 16
mov bh, ah

mov eax, cr
add eax, dcr
mov cr, eax

mov ah, bh  ; faster
mov al, bl
;mov ax, bx

mov DWORD [edx], eax

add edx, 4

dec ecx
jge loop

我可以提供整个项目VS与源进行测试。

I can provide whole VS project with sources for testing.

推荐答案

为什么慢结果
使用16字节的寄存器,其原因是昂贵的,而不是使用8位寄存器是使用16字节寄存器指令在微code德codeD。这意味着解码和无法在一个额外的周期,而解码配对。结果
也正因为斧头是部分注册将采取额外的周期来执行,因为需要与写入下部结合寄存器的顶部。结果
8位写有特殊的硬件到位,以加快这,但16位的写操作也不会。同样在很多处理器的16条指令需要2个循环,而不是之一,他们不允许配对。

Why is it slow
The reason using 16 byte registers is expensive as opposed to use 8-bit register is that using 16 bytes register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
8-bit writes have special hardware put in place to speed this up, but 16 bit writes do not. Again on many processors the 16 instructions take 2 cyles instead of one and they do not allow pairing.

这意味着,不是能够在4个周期处理12个指令(3%周期),你现在只能执行1,因为你必须处理时解码指令到微code和一档时,一档微code。

This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.

我怎样才能使其更快?

mov al, bl
mov ah, bh

(这code至少需要2个CPU周期和可能给第二指令执行停顿,因为一些(旧的)的x86 CPU的你会得到EAX锁)

这里是发生了什么:

(This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
Here's what happens:


  • EAX被读取。的(周期1)

    • EAX的低字节被改变的(仍然周期1)

    • 和全值写回EAX。的(周期1)

    • EAX is read. (cycle 1)
      • The lower byte of EAX is changed (still cycle 1)
      • and the full value is written back into EAX. (cycle 1)

      在介绍最新的Core2 CPU的这与其说是一个问题,因为额外的硬件已经到位,知道了 BL BH 真的从来没有在对方的方式得到的。

      On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl and bh really never get in each other's way.

      mov eax, ebx
      

      其中在一个时间移动的4个字节,即单指令将在1个CPU周期运行(并且可以与其他指令并行配对)。

      Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).


      • 如果你想快速code,始终使用32位的(EAX,EBX等)的寄存器。

      • 尝试避免使用8位的子寄存器,除非你不得不这样做。

      • 不要使用16位的寄存器。即使你已经使用在32位模式5的指令,仍然会更快。

      • 使用MOVZX章,...(或MOVSX章,...)说明

      • If you want fast code, always use the 32-bit (EAX, EBX etc) registers.
      • Try to avoid using the 8 bit sub-registers, unless you have to.
      • Never use the 16-bit registers. Even if you have to use 5 instructions in 32-bit mode, that will still be faster.
      • Use the movzx reg, ... (or movsx reg, ...) instructions

      加快code 结果
      我看到一些机会,加快code。

      Speeding up the code
      I see a few opportunities to speed up the code.

      ; some variables on stack
      %define cr  DWORD [ebp-20]
      %define dcr DWORD [ebp-24]
      %define dcg DWORD [ebp-32]
      %define dcb DWORD [ebp-40]
      
      mov edx,cr
      
      loop:
      
      add esi, dcg
      mov eax, esi
      shr eax, 8
      
      add edi, dcb
      mov ebx, edi
      shr ebx, 16   ;higher 16 bytes in ebx will be empty.
      mov bh, ah
      
      ;mov eax, cr   
      ;add eax, dcr
      ;mov cr, eax
      
      add edx,dcr
      mov eax,edx
      
      and eax,0xFFFF0000  ; clear lower 16 bytes in EAX
      or eax,ebx          ; merge the two. 
      ;mov ah, bh  ; faster
      ;mov al, bl
      
      
      mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order. 
      ;add edx, 4
      
      sub ecx,1  ;dec ecx does not change the carry flag, which can cause
                 ;a false dependency on previous instructions which do change CF    
      jge loop
      

      这篇关于为什么MOV AH,BH和MOV AL,BL一起比单指令MOV AX快得多,BX?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆