增加二进制的效率->8086的格雷码 [英] Increasing Efficiency of binary -> gray code for 8086

查看:74
本文介绍了增加二进制的效率->8086的格雷码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是汇编的初学者,这是我设计的代码,用于从二进制转换为灰色,并以十六进制打印所得的位模式.

I am a beginner in assembly and This is a code that I designed to convert from binary and to gray, and prints the resulting bit-pattern in hex.

      mov      al, a               
      mov      bl, al
      shr      bl, 1
      xor      al, bl                         

尽管程序正在运行,但我想学习其他更简单的方法来提高效率,但我尝试了许多其他方法,但它会影响输出.

Although the program is working, I want to learn about other simpler methods to increase the efficiency, I tried many other ways but it is affecting the output.

推荐答案

(此答案是根据问题的第一个版本编写的,该版本中存在一些尚未优化的代码,并且是的完整源代码.exe程序.对该问题的更新已删除了仅有空间需要优化的部分,除了与8086无关的ILP之外,因此我无意删除答案中的那些部分.)

(This answer was written based on the first version of the question, where there was some not-already-optimal code, and it was complete source for a .exe program. The updates to the question have removed the only parts that had room to be optimized except for ILP which is irrelevant on 8086, so I don't intend to remove those parts of the answer.)

代码大小优化(与8086尤其是8088的速度相关).

Code size optimizations (which correlate with speed on 8086 and especially 8088).

bin2Gray的部分:没有变化,除非您计数从mem重新加载或使 a 不变.
字节->十六进制数字:21字节,低于32 ( code fetch 的字节减少了)
退出:从4(mov ah/int)下移1个字节( ret ).至少适用于 .com 可执行文件,该文件也较小.

bin2Gray part: no change, unless you count reloading from mem or making a constant.
byte->hex digits: 21 bytes, down from 32 (and fewer bytes of code fetch)
exit: 1 byte (ret) down from 4 (mov ah / int). Works for at least .com executables, which are also smaller.

我可能应该在需要获取的总字节数(即执行的指令字节)中计算代码大小,而不是静态代码大小,尽管这对于优化也很有用.从21字节的bin2hex部分中删除循环将花费几个字节的静态代码大小,但会将动态字节数减少约5个IIRC.并避免在采取的分支上丢弃任何预取缓冲区,当然,除了 int 10h .

I probably should have counted code size in total bytes that need to get fetched (i.e. bytes of instructions executed), rather than static code size, although that's also useful to optimize for. Dropping the looping from the 21-byte bin2hex part would cost a couple bytes of static code size, but reduce dynamic byte count by about 5, IIRC. And avoid any prefetch buffer discarding on taken branch, except for the int 10h of course.

int 20h 也可以退出(不需要任何AH设置),但是只能从 .com 可执行文件退出.对我而言,有趣的部分是使用紧凑的代码来计算所需寄存器中的ASCII数字,但是如果您想要一个很小的整体程序,则可以使用 .com .这也避免了DS的设置.(尽管使 a 和EQU或 = 不变,则无需设置DS.)

int 20h can also exit (without needing any AH setup), but only from a .com executable. The interesting part to me is computing the ASCII digits in the desired register with compact code, but if you want a tiny overall program, a .com is the way to go. That also avoids setup of DS. (Although you wouldn't need to set up DS if you make a and EQU or = constant.)

未尝试:利用初始寄存器值,该值在某些DOS版本中显然是可靠的.如果您假装自己正在编写一个可能在较大程序中有用的块,那将是不可行的.

Not attempted: taking advantage of initial register values, which apparently are reliable across some DOS versions. If you're pretending like you're writing a block that might be useful as part of a larger program, that's not viable.

您的程序基本上分为两个部分:计算格雷码,并将bin-> hex用于寄存器中的1字节值.单独提取半字节并不能有效地向后优化格雷码计算.

Your program basically has two separate parts; calculating a Gray code, and bin->hex for a 1-byte value in a register. Extracting the nibbles separately doesn't usefully optimize backwards into the Gray-code calculation.

我认为从二进制到格雷代码不可能进行任何真正的优化.如果您查看整个程序,我们可以查看mov/shr/xor的代码 around .在实际的用例中,通常需要将数据存储在寄存器中的代码,因为这是将数据传递给函数的方式.

I don't think any real optimizations are possible in your binary to Gray code. If you look at the whole program, we can look at the code around the mov/shr/xor. In real use-cases, you'd normally want code that took data in a register, because that's how you should pass data to functions.

但是,如果它是内存中的全局变量,则可以通过加载两次来节省一个代码大小的字节,而不用用 mov 复制寄存器,但这样做的代价是速度.甚至更好,使其成为一个汇编时间常数.(尽管这样做,下一步是 mov al,a ^(a>> 1)在组装时进行计算.)

But if it was a global in memory, you could save a byte of code size by loading it twice instead of copying a register with mov, at the expense of speed. Or even better, make it an assemble-time constant. (Although if you do that, the next step is mov al, a ^ (a>>1) to do the calculation at assemble time.)

; a equ 0ACh         ; makes the following instructions 2 bytes each
;;; otherwise, with a being static storage, loading from memory twice sucks
    mov  al, a
    shr  al, 1    ; 2B, 2 cycles
    xor  al, a    ; reg,imm: 2B, 4 cycles on 8088.    reg,mem: 3B, 13+6 cycles

来自 https://www2的时间.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html .我不确定它的编号是否完全取自指令之后的代码,或者代码获取瓶颈是否分开.8088具有4字节的预取缓冲区(只有8字节的数据总线),在8086上具有16位总线的6字节.为简单起见,我只是使用表中的数字,而没有尝试对预取缓冲区的行为进行建模.

Timings from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html. I'm not sure if its numbers are from after the instruction is fully fetched, or if code-fetch bottlenecks are separate. 8088 has a 4-byte prefetch buffer (and only an 8-byte data bus), 6-byte on 8086 with its 16-bit bus. For simplicity, I just used numbers from the table without trying to model how the prefetch buffer would behave.

代码获取是8088的主要瓶颈,在某种程度上也是8086的瓶颈,所以我认为使用 xchg ax,dx (1个字节)代替 mov ax值得,dx (2个字节),如果您不在乎该值是否仍在DX中.即使该表报告MOV花费了2个周期,但XCHG-with-AX花费了3c.

Code fetch is a major bottleneck on 8088, and to some degree on 8086, so I think it can be worth using xchg ax, dx (1 byte) instead of mov ax, dx (2 bytes) if you don't care about the value still being in DX. Even though that table reports that MOV takes 2 cycles, but XCHG-with-AX takes 3c.

有一项优化将对可以并行运行多条指令的更多现代CPU(以Pentium开头)有所帮助:复制寄存器,然后转移原始,这样可以在同一时间发生循环作为副本.

There's one optimization that will help on more modern CPUs (starting with Pentium) that can run more than 1 instruction in parallel: copy the register, then shift the original so that can happen in the same cycle as the copy.

; optimized for Instruction-level Parallelism
;; input: AL    output: AL = bin_to_gray(AL)
;; clobbers: DL
   mov  dl, al
   shr  al, 1
   xor  al, dl

(有关现代CPU的更多信息,请参见 https://agner.org/optimize/.并且还 x86的MOV是否真的可以"free"?为什么我根本不能重现此内容?-mov-elimination在字节或字寄存器中不起作用,因为它已合并到EDX的低端部分,所以即使在具有mov-elimination的CPU上也是如此通常,它无法在此处应用,因此此优化可节省延迟.)

(For more about modern CPUs, see https://agner.org/optimize/. And also Can x86's MOV really be "free"? Why can't I reproduce this at all? - mov-elimination doesn't work on byte or word registers because that's merging into the low part of EDX. So even on CPUs with mov-elimination in general, it can't apply here so this optimization saves latency.)

我很确定bin中没有进一步的改进空间->灰色.即使是现代的x86也没有复制和右移或右移和异或,因此也无法避免使用 mov ,并且shr和xor显然也是必需的.XOR是无附加值的,但是我认为这没有帮助.除非您具有无进位乘法( pclmulqdq )和一个乘法器常数,以使输入的两个副本以彼此正确的偏移量进入乘法结果的上半部分,否则您将需要做分别进行这些操作.

I'm pretty sure there's no further room for improvement in bin -> gray. Even modern x86 doesn't have a copy-and-right-shift, or right-shift-and-xor, so there's no avoiding a mov, and the shr and xor are obviously also necessary. XOR is add-without-carry, but I don't think that helps. Unless you had carryless multiply (pclmulqdq) and a multiplier constant to get two copies of the input at the right offsets from each other into the high half of a multiply result, you're going to need to do those operations separately.

不过,如果您要详尽检查,请 https://en.wikipedia.org/wiki/超优化-要求超优化器查找与mov/shr/xor序列产生相同AL结果的序列.

Still, if you want to exhaustively check, https://en.wikipedia.org/wiki/Superoptimization - ask a superoptimizer to look for sequences that produce the same AL result as the mov/shr/xor sequence.

这是更有趣的部分.

仅进行2次迭代有时是不值得的,尤其是当您将每一部分分开进行时可以节省一些工作时.(例如,低半字节为 x& 0xf ,高半字节为 x>>> 4 .使用rol/mov/并非最佳选择.)

Looping for only 2 iterations is sometimes not worth it, especially when you can save some work if you do separate things to each half. (e.g. low nibble is x & 0xf, high is x >> 4. Using rol/mov/and is not optimal.)

技巧:

  • 更喜欢 insn al,imm -x86具有

  • Prefer insn al, imm - x86 has short-form special cases for immediate operands with AL. (Also AX,imm16).

想在AL中做一些事情意味着使用 BIOS可以更高效地打印 int 10h /AH = 0Eh 电传输出,它将输入输入AL,并且不会破坏其他任何寄存器.我认为BIOS输出将忽略DOS I/O重定向,例如 foo>outfile.txt ,并始终打印到屏幕上.

Wanting to do stuff in AL means it's more efficient to print with BIOS int 10h / AH=0Eh teletype output which takes its input in AL, and doesn't destroy any other registers. I think BIOS output will ignore DOS I/O redirection like foo > outfile.txt and always print to the screen.

有一个邪恶的黑客滥用 DAS 将0..15整数转换为ASCII十六进制数字'0'..'9''A'..'F'无分支.在8086(不同于现代的x86)上,DAS的速度与典型的整数指令一样快.有关以下内容的细分,请参见此codegolf.SE答案.确切的原因;这是非常不明显的,但是它避免分支,因此实际上在8086上有很大的提速.

There's an evil hack that abuses DAS to turn a 0..15 integer into an ASCII hex digit '0'..'9' or 'A'..'F' without branching. On 8086 (unlike modern x86) DAS is just as fast as typical integer instruction. See this codegolf.SE answer for a breakdown on exactly why it works; it's highly non-obvious, but it avoid branching so it's actually a big speedup on 8086.

BIOS/DOS调用通常不会修改AH,因此可以在循环外进行设置.

BIOS / DOS calls generally don't modify AH, so setting it can be done outside the loop.

然后获取代码大小,而不仅仅是展开,而是使用 cl = 4 作为循环计数器来循环并重新运行 some 较早的代码一次(不包括移位). sub cl,2 / jnz 可以工作,但是使用奇偶校验标志是使用 dec cx (1B)/ jpe的一种方式向后跳一次,然后跌落到下一次.

Then for code-size, instead of just unrolling, use the cl=4 as a loop counter to loop back and re-run some of the earlier code once (not including the shift). sub cl, 2 / jnz would work, but using the parity flag is a way to use dec cx (1B) / jpe to jump backwards once, then fall through the next time.

DOS程序(或至少 .com 程序)的SP指向干净退出的某些代码的地址.因此,您可以通过 ret 退出.

DOS programs (or at least .com programs) have SP pointing at the address of some code that exits cleanly. So you can exit via ret.

(我并没有考虑在保持整体策略的同时改善循环.使用 AL 尽可能多的指令是值得的,但是运行 rol 两次而不是一次移位在8086上花费了很多周期:按CL移位为8 + 4 * n.)

(I didn't look at improving your loop while keeping the overall strategy. Using AL for as many instructions as possible is prob. worth it, but running rol twice instead of shifting once costs a lot of cycles on 8086: 8 + 4*n for shift-by-CL.)

;; input: byte in AL.   output: print 2 ASCII hex digits with BIOS int 10h
;; clobbers: CX, DX
hexprint_byte:
     mov   ah, 0Eh                       ; BIOS teletype call #
;     push  ax            ; 1B 15c
     mov   dx, ax        ; 2B 2c         ; save number, and AH=call number
     mov   cl, 4         ; 2B 4c
     shr   al, cl        ; 2B 8+4*4 cycles   isolate the high nibble
.loop:
     cmp   al, 10        ; 2B 4c  set CF according to digit <= 9
     sbb   al, 69h       ; 2B 4c  read CF, set CF and conditionally set AF
     das                 ; 1B 4c   magic, which happens to work
     int   10h           ; 2B        BIOS teletype output (AL), no return value
;     pop   ax            ; 1B 12c    ; would do one extra pop if you used this instead of mov/xchg, so you'd need jmp ax instead of ret.  But AND destroys AX
     xchg  ax, dx        ; 1B 3c     ; retrieve the original again (with AH=0Eh call number)
     and   al, 0Fh       ; 2B 4c     ; isolate the low nibble this time

     dec   cx            ; 1B 3c     ; PF is set from the low byte only, CH garbage isn't a problem.  
     jpe   .loop         ; 2B 4c-not-taken, 16c-taken
               ;  4-1 = 3, (0b11) which has even parity
               ; so JPE is taken the first time, falls through the 2nd 

;; size = 21 bytes

然后您可以使用 ret int 20h 退出程序.

Then you can exit the program with ret, or with int 20h.

这是NASM语法;如果您的汇编器不喜欢 .loop ,则将其更改为其他名称.(NASM不允许 2:作为本地标签,因此无论如何我都必须选择其他名称.)我在Linux上测试了此单步操作,以确保循环分支被采用一次,并且我达到 int 10h 时,AH/AL中的值正确.(我用NOP替换了它,因为我实际上将其内置到32位静态可执行文件中,因此我可以轻松地在GDB中将其单步执行,而不会搞砸过时的16位开发者设置.字节数来自于16-位,当然.)

This is NASM syntax; if your assembler doesn't like .loop then change it to something else. (NASM doesn't allow 2: as a local label so I had to pick different names anyway.) I tested this single-stepping on Linux to make sure the loop branch was taken once, and that I got the right values in AH/AL when int 10h was reached. (I replaced it with a NOP since I actually built this into a 32-bit static executable so I could easily single-step it in GDB, without messing around with an obsolete 16-bit dev setup. Byte counts are from assembling as 16-bit, of course.)

为了提高速度,只需复制cmp/sbb/das/int 10h即可节省几个字节,并保存 dec / jpe .(对于dec/jpe,像7个字节,而不是3个字节).无论哪种方式,第一次打印后的xchg/AND都是必需的.

For speed, it would cost only a few more bytes to just duplicate the cmp/sbb/das/int 10h, saving the dec/jpe. (Like 7 bytes instead of 3 for the dec/jpe). The xchg / AND after the first print are necessary either way.

标记分支花费16个周期,这将避免第二次冗余/无用地执行 xchg /(3个字节/7个周期)以及循环开销

Taken branches cost 16 cycles, and that would avoid a 2nd redundant/useless execution of xchg / and (3 bytes / 7 cycles), and of the loop overhead.

您要的东西很小(在8086上速度很快),所以我就是这么做的.为了节省字节数,这牺牲了其他所有方面,包括可读性在内.但这就是汇编中的代码查询的乐趣!

You asked for small (and fast on 8086), so that's what I did. This sacrifices everything else, very much including readability, to save bytes. But that's the fun of code-golfing in assembly!

不幸的是,它也肯定不是更简单,就像您在标题中所要求的那样.更简单的方法可能是使用查找表,也许与xlatb一起使用.在8086上,这可能也会更快,特别是如果您想避免 DAS 黑客攻击.

Unfortunately it's also definitely not simpler, like you asked for in the title. Simpler might use a lookup table, perhaps with xlatb. That might also be faster on 8086, especially if you want to avoid the DAS hack.

可能有助于代码大小(但对性能非常不利)的另一个技巧是 aam 16 来设置AH =商=前导数字,AL =余数=尾随数字(低的).(请注意,与 div bl 相反)在装配中显示时间显示将其与BIOS int 10h输出一起使用的两位数字 decimal 数字的示例.(通常,AAM与立即数 10 一起使用,显然NEC V20 CPU忽略立即数,并且总是除以10.IntelCPU只是对AL进行立即数划分).在8088/8086上,AAM需要83个周期,类似于 div ,这基本上就是它的工作.使用2的幂进行硬件除法通常是可怕的.

Another trick that might help for code size (but very bad for performance) is aam 16 to set AH= quotient = leading digit, AL = remainder = trailing digit (low). (Note that's opposite of div bl) Displaying Time in Assembly shows an example of using it with BIOS int 10h output for a 2-digit decimal number. (Normally AAM is used with an immediate 10, and apparently NEC V20 CPUs ignore the immediate and always divide by 10. Intel CPUs just do immediate division of AL). On 8088/8086, AAM takes 83 cycles, similar to div, which is basically what it does. Using HW division with a power of 2 is generally horrible.

使用AAM 16的版本以23个字节出现,使用任何循环(我在寄存器中没有要利用的常量,因此 mov cx,1 / loop 为5个字节,而cmp/sbb/das/int 10h总计为7)

A version using AAM 16 came in at 23 bytes, not using any looping (I didn't have any constants in registers to exploit, so mov cx, 1 / loop would be 5 bytes, while cmp/sbb/das/int 10h is 7 total)

     aam   16       ; 83 cycles, 2 bytes   AH= quotient = leading digit   AL = remainder = trailing digit (low)
  ; normally never use div or aam by a power of 2, only for code-size over speed.
     cmp   al, 10        ; 2B 4c  set CF according to digit <= 9
     sbb   al, 69h       ; 2B 4c  read CF, set CF and conditionally set AF
     das                 ; 1B 4c   magic, which happens to work
     xchg  dx, ax        ; 1B 3c    stash low digit in DL

     mov   al, dh        ; 2B 2c    get leading digit
     cmp   al, 10        ; 2B 4c
     sbb   al, 69h       ; 2B 4c    most significant (high) nibble as ASCII hex
     das                 ; 1B 4c
     
     mov   ah, 0Eh       ; 2B 3c BIOS teletype output (of AL), advancing cursor
     int   10h           ; 2B ?
     mov   al, dl        ; 2B 2c  ; get low digit back from DL   xchg  ax, dx breaks AH callnum
     int   10h           ; 2B
; size=23B

我想知道是否可以对来自其中一个输出的DL输入使用 int 21h/AH = 2 ?这将需要更改AH,但也许可以对第二个输出进行更改.不幸的是,DOS调用在AL上执行,将其设置为打印的字符.(例如,通过使用此int 10h调用).

I wonder if I could use int 21h / AH=2 with input from DL for one of the outputs? That would require changing AH, but could perhaps be done for the 2nd output. Unfortunately that DOS call steps on AL, setting it to the character printed. (e.g. from using this int 10h call).

相关:

这篇关于增加二进制的效率->8086的格雷码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆