英特尔x86汇编优化技术,用于将8位扩展为0或1的8个布尔字节 [英] Intel x86 assembly optimization techniques for expanding 8 bits to 8 boolean bytes of 0 or 1

查看:88
本文介绍了英特尔x86汇编优化技术,用于将8位扩展为0或1的8个布尔字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了相当长的时间学习汇编程序,并且试图将一些简单的程序\函数重写为它,以查看性能好处(如果有).我的主要开发工具是Delphi 2007,第一个示例将使用该语言,但也可以轻松将其翻译为其他语言.

I am learning assembler quite a while and I am trying to rewrite some simple procedures \ functions to it to see performance benefits (if any). My main development tool is Delphi 2007 and first examples will be in that language but they can be easily translated to other languages as well.

问题表示为:

我们给出了一个无符号字节值,其中八个位中的每个位代表屏幕一行中的一个像素.每个单个像素可以是实心(1)或透明(0).换句话说,我们将8个像素打包成一个字节. 我想将这些像素解压缩为八个字节的数组,以使最小的pixel(bit)降落在数组的最低索引下,依此类推.这是一个示例:

We have given an unsigned byte value in which each of the eight bits represents a pixel in one row of a screen. Each single pixel can be solid (1) or transparent (0). So in other words, we have 8 pixels packed in one byte value. I want to unpack those pixels into an eight byte array in the way that youngest pixel(bit) will land under the lowest index of the array and so on. Here is an example:

One byte value -----------> eight byte array

10011011 -----------------> [1][1][0][1][1][0][0][1]

Array index number ------->  0  1  2  3  4  5  6  7

下面,我介绍解决问题的五种方法.接下来,我将展示他们的时间比较以及如何测量这些时间.

Below I present five methods which are solving the problem. Next I will show their time comparison and how I did measure those times.

我的问题包括两个部分:

My questions consist of two parts:

我要问您有关方法DecodePixels4aDecodePixels4b详细答案.为什么方法4b4a慢一些?

I am asking you for detailed answer concerning methods DecodePixels4a and DecodePixels4b. Why method 4b is somewhat slower than the 4a?

例如,如果由于我的代码未正确对齐而变慢了,那么请告诉我给定方法中的哪些指令可以更好地对齐,以及如何做到这一点而不会破坏方法.

If for example it is slower because my code is not aligned correctly then show me which instructions in a given method could be better aligned and how to do this to not break the method.

我想看看理论背后的真实例子.请记住,我正在学习汇编语言,我想从您的答案中获得知识,这使我将来可以编写更好的优化代码.

I would like to see real examples behind the theory. Please bear in mind that I am learning assembly and I want to gain knowledge from your answers which allows me in the future to writing better optimized code.

您可以编写比DecodePixels4a更快的例程吗?如果是这样,请提供它并描述您已采取的优化步骤. 所谓更快的例程",是指在此处介绍的所有例程中,在您的测试环境中运行时间最短的例程.

Can you write faster routine than DecodePixels4a? If so, please present it and describe optimization steps that you have taken. By faster routine I mean routine that runs in the shortest period of time in your test environment among all the routines presented here.

允许使用所有英特尔家族处理器以及与之兼容的处理器.

All Intel family processors are allowed and those which are compatible with them.

下面您会发现我编写的例程:

Below you will find routines written by me:

procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
  i3: Integer;
begin
  DecPixels[0] := EncPixels and $01;
  for i3 := 1 to 7 do
  begin
    EncPixels := EncPixels shr 1;
    DecPixels[i3] := EncPixels and $01;
    //DecPixels[i3] := (EncPixels shr i3) and $01;  //this is even slower if you replace above 2 lines with it
  end;
end;


//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[1] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[2] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[3] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[4] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[5] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[6] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[7] := EncPixels and $01;
end;


procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    push ecx;
    mov bl, al;
    and bl, $01;
    mov [edx], bl;
    mov ecx, $00;
@@Decode:
    inc ecx;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + ecx], bl;
    cmp ecx, $07;
    jnz @@Decode;
    pop ecx;
    pop ebx;
    pop eax;
  end;
end;


//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    mov  [edx], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $01], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $02], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $03], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $04], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $05], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $06], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;


// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx], bl;        //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $01], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $02], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $03], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $04], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $05], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $06], bl;  //
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;

这是我如何测试它们:

program Test;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

type
  TDecodedPixels = array[0..7] of Byte;
var
  Pixels: TDecodedPixels;
  Freq, TimeStart, TimeEnd :Int64;
  Time1, Time2, Time3, Time4a, Time4b: Extended;
  i, i2: Integer;

begin
  if QueryPerformanceFrequency(Freq) then
  begin
    for i2 := 1 to 100 do
    begin
      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels1(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels2(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels3(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4a(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4b(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);

    end;
    Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms.    <- Delphi loop.');
    Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms.    <- Delphi unrolled loop.');
    Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms.    <- BASM loop.');
    Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms.    <- BASM unrolled loop.');
    Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms.    <- BASM unrolled loop instruction switch.');
  end;
  Readln;
end.

这是我的计算机上的结果(在Win32 XP上为Intel®Pentium®E2180):

Here are the results from my machine ( Intel® Pentium® E2180 on Win32 XP) :

Time1  : 1,68443549919493 ms.     <- Delphi loop.
Time2  : 1,33773024572211 ms.     <- Delphi unrolled loop.
Time3  : 1,37015271374424 ms.     <- BASM loop.
Time4a : 0,822916962526627 ms.    <- BASM unrolled loop.
Time4b : 0,862914462301607 ms.    <- BASM unrolled loop instruction switch.

结果非常稳定-我进行的每个测试之间的时间仅相差几个百分点.始终都是这样:Time1 > Time3 > Time 2 > Time4b > Time4a

The results are pretty stable - times vary only by few percent between each test I've made. And that was always true: Time1 > Time3 > Time 2 > Time4b > Time4a

因此,我认为Time4a和Time4b之间的不同取决于方法DecodePixels4b中的指令切换.有时是4%,有时是10%,但是4b总是比4a慢.

So I think that de difference between Time4a and Time4b depends of that instructions switch in the method DecodePixels4b. Sometimes it is 4% sometimes it is up to 10% but 4b is always slower than 4a.

我正在考虑另一种使用MMX指令一次将8个字节写入内存的方法,但是我不知道将字节解压缩到64位寄存器中的快速方法.

I was thinking about another method with usage of MMX instructions to write into memory eight bytes at one time, but I can't figure out fast way to unpack byte into the 64 bit register.

谢谢您的时间.

谢谢你们的宝贵意见.虽然我可以同时回答所有人,但是与现代CPU相比,不幸的是,我只有一个管道",并且一次只能执行一个指令答复" ;-) 因此,我将尝试在此处总结一些内容,并在您的答案下写一些其他评论.

Thank you guys for your valuable input. Whish I could answer all of you at the same time, unfortunately compared to the modern CPU's I have only one "pipe" and can execute only one instruction "reply" at the time ;-) So, I will try sum up some things over here and write additional comments under your answers.

首先,我想说的是,在发布我的问题之前,我想出了Wouter van Nifterick提出的解决方案,它实际上比我的汇编代码要慢得多. 因此,我决定不在此处发布该例程,但是您可能会看到,我在该例程的循环Delphi版本中也采用了相同的方法.有人在这里评论,因为它给了我更差的结果.

First of all, I wanted to say that before posting my question I came up with the solution presented by Wouter van Nifterick and it was actually way slower then my assembly code. So I've decided not to post that routine here, but you may see that I took the same approach also in my loop Delphi version of the routine. It is commented there because it was giving me worser results.

这对我来说是个谜.我再次使用Wouter和PhilS的例程运行代码,结果如下:

This is a mystery for me. I've run my code once again with Wouter's and PhilS's routines and here are the results:

Time1  : 1,66535493194387 ms.     <- Delphi loop.
Time2  : 1,29115785420688 ms.     <- Delphi unrolled loop.
Time3  : 1,33716934524107 ms.     <- BASM loop.
Time4a : 0,795041753757838 ms.    <- BASM unrolled loop.
Time4b : 0,843520166815013 ms.    <- BASM unrolled loop instruction switch.
Time5  : 1,49457681191307 ms.     <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,400587402866258 ms.    <- PhiS, table lookup Delphi
Time7  : 0,325472442519827 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,37350491544239 ms.     <- PhiS, table lookup BASM

看看Time5的结果,是不是很奇怪? 我猜我有不同的Delphi版本,因为我生成的汇编代码与Wouter提供的代码不同.

Look at the Time5 result, quite strange isn't it? I guess I have different Delphi version, since my generated assembly code differs from that provided by Wouter.

第二项主要修改:

我知道为什么常规5在我的机器上比较慢.我在编译器选项中选中了范围检查"和溢出检查".我已将assembler指令添加到例程9中,以查看是否有帮助.看起来,使用此指令的汇编过程与Delphi内联变体一样好,甚至更好.

I know why routine 5 was slower on my machnie. I had checked "Range checking" and "Overflow checking" in my compiler options. I've added assembler directive to routine 9 to see if it helps. It seems that with this directive assembly procedure is as good as Delphi inline variant or even slightly better.

这是最终结果:

Time1  : 1,22508325749317 ms.     <- Delphi loop.
Time2  : 1,33004145373084 ms.     <- Delphi unrolled loop.
Time3  : 1,1473583622526 ms.      <- BASM loop.
Time4a : 0,77322594033463 ms.     <- BASM unrolled loop.
Time4b : 0,846033593023372 ms.    <- BASM unrolled loop instruction switch.
Time5  : 0,688689382044384 ms.    <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,503233741036693 ms.    <- PhiS, table lookup Delphi
Time7  : 0,385254722925063 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,432993919452751 ms.    <- PhiS, table lookup BASM
Time9  : 0,362680491244212 ms.    <- PhiS, table lookup BASM with assembler directive

第三项主要修改:

@Pascal Cuoq和@j_random_hacker认为,例程4a4b5之间的执行时间差异是由数据依赖性引起的.但是,基于我所做的进一步测试,我不得不不同意这种观点.

In opinion @Pascal Cuoq and @j_random_hacker the difference in execution times between routines 4a, 4b and 5 is caused by the data dependency. However I have to disagree with that opinion basing on the further tests that I've made.

我还发明了基于4a的新例程4c.在这里:

I've also invented new routine 4c based on 4a. Here it is:

procedure DecodePixels4c(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push ebx;
    mov bl, al;
    and bl, 1;
    mov [edx], bl;
    mov bl, al;
    shr bl, 1;
    and bl, 1;
    mov [edx + $01], bl;
    mov bl, al;
    shr bl, 2;
    and bl, 1;
    mov [edx + $02], bl;
    mov bl, al;
    shr bl, 3;
    and bl, 1;
    mov [edx + $03], bl;
    mov bl, al;
    shr bl, 4;
    and bl, 1;
    mov [edx + $04], bl;
    mov bl, al;
    shr bl, 5;
    and bl, 1;
    mov [edx + $05], bl;
    mov bl, al;
    shr bl, 6;
    and bl, 1;
    mov [edx + $06], bl;
    shr al, 7;
    and al, 1;
    mov [edx + $07], al;
    pop ebx;
  end;
end;

我会说这取决于数据.

这是测试和结果.我已经进行了四项测试,以确保没有发生意外. 我还为GJ提出的例程添加了新的时间(Time10a,Time10b).

And here are the tests and results. I've made four tests to make sure there is no accident. I've also added new times for the routines proposed by GJ (Time10a, Time10b).

          Test1  Test2  Test3  Test4

Time1   : 1,211  1,210  1,220  1,213
Time2   : 1,280  1,258  1,253  1,332
Time3   : 1,129  1,138  1,130  1,160

Time4a  : 0,690  0,682  0,617  0,635
Time4b  : 0,707  0,698  0,706  0,659
Time4c  : 0,679  0,685  0,626  0,625
Time5   : 0,715  0,682  0,686  0,679

Time6   : 0,490  0,485  0,522  0,514
Time7   : 0,323  0,333  0,336  0,318
Time8   : 0,407  0,403  0,373  0,354
Time9   : 0,352  0,378  0,355  0,355
Time10a : 1,823  1,812  1,807  1,813
Time10b : 1,113  1,120  1,115  1,118
Time10c : 0,652  0,630  0,653  0,633
Time10d : 0,156  0,155  0,172  0,160  <-- current winner!

您可能会看到4a4b4c5的结果非常接近. 这是为什么?因为我已经从中删除了 4a,4b(4c已经没有)两个指令:push eaxpop eax.因为我知道我不会在代码中的任何其他地方使用eax下的值,所以我不必保留它. 现在我的代码只有一对push/pop,如例行程序5所示. 例程5保留eax的值,因为它首先在ecx下复制了它,但并不保留ecx.

As you may see the results of 4a, 4b, 4c and 5 are very close to each other. Why is that? Because I've removed from 4a, 4b (4c already doesn't have it) two instructions: push eax and pop eax. Since I know I wont use anywhere else in my code the value under eax I do not have to prereserve it. Now my code has only one pair of push/pop so as the routine 5. Routine 5 prereserves value of eax beacause it firstly make copy of it under ecx but it deson't prereserve ecx.

所以我的结论是:5、4a和4b在执行时间上的差异(在第三次编辑之前)与数据依赖性无关,而是由另外一对推/弹出指令引起的

So my conclusion is that: the difference in time execution of 5 and 4a and 4b (before the third edit) didn't concern data dependecny but was caused by additional pair of push / pop instructions.

我对您的评论非常感兴趣.

I am very interested in your comments.

几天后,GJ发明了比PhiS更快的例程(时间10d).干得好GJ!

After a few days GJ invented even faster routine (Time 10d) than PhiS's. Nice work GJ!

推荐答案

您的asm代码相对较慢,因为使用堆栈末端将8次写入内存. 检查这个...

Your asm code is relativity slow because use stack end write 8 times to memory. Check this one...

procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;

也许比使用查找表的代码还要快!

Maybe is even faster than code with lookup table!

改进版本:

procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;

版本3:

procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;

版本4:

const Uint32DecPix : array [0..15] of cardinal = (
  $00000000, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and $0F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;

这篇关于英特尔x86汇编优化技术,用于将8位扩展为0或1的8个布尔字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆