在样本问题英特尔x86汇编优化技术 [英] Intel x86 assembly optimization techniques in a sample problem

查看:192
本文介绍了在样本问题英特尔x86汇编优化技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我学习汇编很长一段时间,我试图重写一些简单的程序\\功能,它性能优势(如果有的话)。我的主要开发工具是德尔福2007年第一例将这种语言,但他们可以很容易地翻译成其他语言。

I am learning assembler quite a while and I am trying to rewrite some simple procedures \ functions to it to see performance benefits (if any). My main development tool is Delphi 2007 and first examples will be in that language but they can be easily translated to other languages as well.

问题的状态为:

我们已经给一个无符号字节值,其中每个八位的重新presents中的画面的一个行的像素。每个单个像素可以是固体(1)或透明的(0)。因此,换句话说,我们有装在一个字节值8个像素。
我想这些像素解压到在最小的像素(位)将在阵列等指数最低下土地的方式八个字节数组。下面是一个例子:

We have given an unsigned byte value in which each of the eight bits represents a pixel in one row of a screen. Each single pixel can be solid (1) or transparent (0). So in other words, we have 8 pixels packed in one byte value. I want to unpack those pixels into an eight byte array in the way that youngest pixel(bit) will land under the lowest index of the array and so on. Here is an example:

One byte value -----------> eight byte array

10011011 -----------------> [1][1][0][1][1][0][0][1]

Array index number ------->  0  1  2  3  4  5  6  7

下面我present五种方法这是解决问题。接下来,我将展示他们的时间比较,我怎么衡量这些次。

Below I present five methods which are solving the problem. Next I will show their time comparison and how I did measure those times.

我的问题由两部分组成:

My questions consist of two parts:

我要问你的详细关于答案的方法德codePixels4a 德codePixels4b 。为什么方法 4B 比有点慢了 4A

I am asking you for detailed answer concerning methods DecodePixels4a and DecodePixels4b. Why method 4b is somewhat slower than the 4a?

例如,如果它比较慢,因为我的code未正确对齐,然后告诉我这在一个给定的方法的指令可以更好地对齐,如何做到这一点不破的方法。

If for example it is slower because my code is not aligned correctly then show me which instructions in a given method could be better aligned and how to do this to not break the method.

我想看到背后的理论实际例子。请记住,我学习组装,我想从你的答案获得知识,让我在今后写出更好的优化,code。

I would like to see real examples behind the theory. Please bear in mind that I am learning assembly and I want to gain knowledge from your answers which allows me in the future to writing better optimized code.

您可以写比德codePixels4a 常规速度更快?如果是这样,请present,并描述已经采取了优化措施。
通过的快速进行日常的我的意思是在最短的时间内间这里psented所有例程$ P $运行在您的测试环境程序。

Can you write faster routine than DecodePixels4a? If so, please present it and describe optimization steps that you have taken. By faster routine I mean routine that runs in the shortest period of time in your test environment among all the routines presented here.

所有Intel系列处理器是允许的,那些与其兼容。

All Intel family processors are allowed and those which are compatible with them.

下面,你会发现我写的程序:

Below you will find routines written by me:

procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
  i3: Integer;
begin
  DecPixels[0] := EncPixels and $01;
  for i3 := 1 to 7 do
  begin
    EncPixels := EncPixels shr 1;
    DecPixels[i3] := EncPixels and $01;
    //DecPixels[i3] := (EncPixels shr i3) and $01;  //this is even slower if you replace above 2 lines with it
  end;
end;


//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[1] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[2] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[3] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[4] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[5] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[6] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[7] := EncPixels and $01;
end;


procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    push ecx;
    mov bl, al;
    and bl, $01;
    mov [edx], bl;
    mov ecx, $00;
@@Decode:
    inc ecx;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + ecx], bl;
    cmp ecx, $07;
    jnz @@Decode;
    pop ecx;
    pop ebx;
    pop eax;
  end;
end;


//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    mov  [edx], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $01], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $02], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $03], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $04], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $05], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $06], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;


// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx], bl;        //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $01], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $02], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $03], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $04], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $05], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $06], bl;  //
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;

这里是我如何对其进行测试:

And here is how do I test them:

program Test;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

type
  TDecodedPixels = array[0..7] of Byte;
var
  Pixels: TDecodedPixels;
  Freq, TimeStart, TimeEnd :Int64;
  Time1, Time2, Time3, Time4a, Time4b: Extended;
  i, i2: Integer;

begin
  if QueryPerformanceFrequency(Freq) then
  begin
    for i2 := 1 to 100 do
    begin
      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels1(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels2(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels3(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4a(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4b(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);

    end;
    Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms.    <- Delphi loop.');
    Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms.    <- Delphi unrolled loop.');
    Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms.    <- BASM loop.');
    Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms.    <- BASM unrolled loop.');
    Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms.    <- BASM unrolled loop instruction switch.');
  end;
  Readln;
end.

下面是从我的机器(英特尔®奔腾®E2180在Win32 XP)的结果:

Here are the results from my machine ( Intel® Pentium® E2180 on Win32 XP) :

Time1  : 1,68443549919493 ms.     <- Delphi loop.
Time2  : 1,33773024572211 ms.     <- Delphi unrolled loop.
Time3  : 1,37015271374424 ms.     <- BASM loop.
Time4a : 0,822916962526627 ms.    <- BASM unrolled loop.
Time4b : 0,862914462301607 ms.    <- BASM unrolled loop instruction switch.

的结果是pretty稳定 - 时间只有每个测试我做的几个百分点的变化。这是总是正确的:时间1&GT;时间3&GT;时间2 - ; Time4b&GT; Time4a

所以,我认为Time4a和Time4b德之间的差别取决于那说明交换机的方法德codePixels4b 。有时是4%,有时高达10%,但 4B 总是比 4A 慢。

So I think that de difference between Time4a and Time4b depends of that instructions switch in the method DecodePixels4b. Sometimes it is 4% sometimes it is up to 10% but 4b is always slower than 4a.

我在想用的MMX指令使用另一种方法来写入到存储器的8个字节在同一时间,但我想不出快速的方法来解压缩字节到64位寄存器。

I was thinking about another method with usage of MMX instructions to write into memory eight bytes at one time, but I can't figure out fast way to unpack byte into the 64 bit register.

感谢您的时间。


感谢你们的宝贵意见。韦思我可以在同一时间回答大家,可惜比起现代CPU的我只有一个管,并在当时只能执行一条指令复函;-)
所以,我会尝试总结出一些东西在这里,并在你的答案写补充意见。

Thank you guys for your valuable input. Whish I could answer all of you at the same time, unfortunately compared to the modern CPU's I have only one "pipe" and can execute only one instruction "reply" at the time ;-) So, I will try sum up some things over here and write additional comments under your answers.

首先,我想说,发布我的问题之前,我想出了由沃特面包车Nifterick psented解决方案$ P $,它实际上是的方法要慢,然后我组装code 。
所以,我决定不张贴在这里,常规,但你可能会看到我采取了同样的做法也在我的循环常规的德尔福版本。它被注释那里,因为这是给我沃瑟结果。

First of all, I wanted to say that before posting my question I came up with the solution presented by Wouter van Nifterick and it was actually way slower then my assembly code. So I've decided not to post that routine here, but you may see that I took the same approach also in my loop Delphi version of the routine. It is commented there because it was giving me worser results.

这对我来说是一个谜。我曾与沃特的和菲尔斯的程序和运行在这里我的code再次是结果:

This is a mystery for me. I've run my code once again with Wouter's and PhilS's routines and here are the results:

Time1  : 1,66535493194387 ms.     <- Delphi loop.
Time2  : 1,29115785420688 ms.     <- Delphi unrolled loop.
Time3  : 1,33716934524107 ms.     <- BASM loop.
Time4a : 0,795041753757838 ms.    <- BASM unrolled loop.
Time4b : 0,843520166815013 ms.    <- BASM unrolled loop instruction switch.
Time5  : 1,49457681191307 ms.     <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,400587402866258 ms.    <- PhiS, table lookup Delphi
Time7  : 0,325472442519827 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,37350491544239 ms.     <- PhiS, table lookup BASM

看那Time5结果,很奇怪是不是?
我想我有不同的Delphi版本,因为我生成的汇编code从所提供的沃特不同。

Look at the Time5 result, quite strange isn't it? I guess I have different Delphi version, since my generated assembly code differs from that provided by Wouter.

第二大编辑:


Second major edit:


我知道为什么常规 5 在我machnie慢。我测试过范围检查,并在我的编译器选项溢出检查。我已经添加了汇编程序指令例行 9 来看看是否有帮助。看来,这项指令组装过程是德尔福变种内联,甚至略胜一筹好。

I know why routine 5 was slower on my machnie. I had checked "Range checking" and "Overflow checking" in my compiler options. I've added assembler directive to routine 9 to see if it helps. It seems that with this directive assembly procedure is as good as Delphi inline variant or even slightly better.

下面是最终的结果:

Time1  : 1,22508325749317 ms.     <- Delphi loop.
Time2  : 1,33004145373084 ms.     <- Delphi unrolled loop.
Time3  : 1,1473583622526 ms.      <- BASM loop.
Time4a : 0,77322594033463 ms.     <- BASM unrolled loop.
Time4b : 0,846033593023372 ms.    <- BASM unrolled loop instruction switch.
Time5  : 0,688689382044384 ms.    <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,503233741036693 ms.    <- PhiS, table lookup Delphi
Time7  : 0,385254722925063 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,432993919452751 ms.    <- PhiS, table lookup BASM
Time9  : 0,362680491244212 ms.    <- PhiS, table lookup BASM with assembler directive

第三大编辑:


在舆论@Pascal Cuoq和@j_random_hacker在执行时间例程间 4A 4B 区别和 5 由数据依赖造成的。不过我对此的意见基础上,我所做的进一步测试不同意。

In opinion @Pascal Cuoq and @j_random_hacker the difference in execution times between routines 4a, 4b and 5 is caused by the data dependency. However I have to disagree with that opinion basing on the further tests that I've made.

我也根据 4A 发明新的常规 4C 。在这里,它是:

I've also invented new routine 4c based on 4a. Here it is:

procedure DecodePixels4c(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push ebx;
    mov bl, al;
    and bl, 1;
    mov [edx], bl;
    mov bl, al;
    shr bl, 1;
    and bl, 1;
    mov [edx + $01], bl;
    mov bl, al;
    shr bl, 2;
    and bl, 1;
    mov [edx + $02], bl;
    mov bl, al;
    shr bl, 3;
    and bl, 1;
    mov [edx + $03], bl;
    mov bl, al;
    shr bl, 4;
    and bl, 1;
    mov [edx + $04], bl;
    mov bl, al;
    shr bl, 5;
    and bl, 1;
    mov [edx + $05], bl;
    mov bl, al;
    shr bl, 6;
    and bl, 1;
    mov [edx + $06], bl;
    shr al, 7;
    and al, 1;
    mov [edx + $07], al;
    pop ebx;
  end;
end;

我会说这是pretty数据相关。

I would say it is pretty data dependent.

和这里的测试和结果。我做了四个测试,以确保没有意外。
我还添加了新的时代由GJ(Time10a,Time10b)。

And here are the tests and results. I've made four tests to make sure there is no accident. I've also added new times for the routines proposed by GJ (Time10a, Time10b).

          Test1  Test2  Test3  Test4

Time1   : 1,211  1,210  1,220  1,213
Time2   : 1,280  1,258  1,253  1,332
Time3   : 1,129  1,138  1,130  1,160

Time4a  : 0,690  0,682  0,617  0,635
Time4b  : 0,707  0,698  0,706  0,659
Time4c  : 0,679  0,685  0,626  0,625
Time5   : 0,715  0,682  0,686  0,679

Time6   : 0,490  0,485  0,522  0,514
Time7   : 0,323  0,333  0,336  0,318
Time8   : 0,407  0,403  0,373  0,354
Time9   : 0,352  0,378  0,355  0,355
Time10a : 1,823  1,812  1,807  1,813
Time10b : 1,113  1,120  1,115  1,118
Time10c : 0,652  0,630  0,653  0,633
Time10d : 0,156  0,155  0,172  0,160  <-- current winner!

正如你可能会看到 4A 4B 4C 5 非常接近对方。
这是为什么?因为我的删除从4A,4B(4C已经没有它)两条指令:推EAX 流行EAX 。因为我知道我不会用其他地方在我的code值下的EAX我没有$ P $保存它。
现在我的code的只有一对推/流行这样的日常5。
常规5 $ P $保留EAX价值东阳它首先做它下副本ECX但deson't $ P $保存ECX。

As you may see the results of 4a, 4b, 4c and 5 are very close to each other. Why is that? Because I've removed from 4a, 4b (4c already doesn't have it) two instructions: push eax and pop eax. Since I know I wont use anywhere else in my code the value under eax I do not have to prereserve it. Now my code has only one pair of push/pop so as the routine 5. Routine 5 prereserves value of eax beacause it firstly make copy of it under ecx but it deson't prereserve ecx.

所以,我的结论是:在5个时间执行,4A和4B(第三编辑之前)的的区别是没有关系的数据dependecny,但被另外一对推压/ pop指令引起

So my conclusion is that: the difference in time execution of 5 and 4a and 4b (before the third edit) didn't concern data dependecny but was caused by additional pair of push / pop instructions.

我对你的意见很感兴趣。

I am very interested in your comments.

几天后,GJ发明更快比常规PHIS的(时间10D)。不错的工作GJ!

After a few days GJ invented even faster routine (Time 10d) than PhiS's. Nice work GJ!

推荐答案

您ASM code是相对缓慢的,因为使用的堆栈年底写8次到内存中。
检查这一个...

Your asm code is relativity slow because use stack end write 8 times to memory. Check this one...

procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;

也许是速度甚至比code。与查找表!

Maybe is even faster than code with lookup table!

改进版:

procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;

第3版:

procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;

第4版:

const Uint32DecPix : array [0..15] of cardinal = (
  $00000000, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and $0F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;

这篇关于在样本问题英特尔x86汇编优化技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆