英特尔x86组装优化技术在示例问题中 [英] Intel x86 assembly optimization techniques in a sample problem

查看：206 发布时间：2017/4/24 3:12:16 delphi optimization assembly x86 basm

本文介绍了英特尔x86组装优化技术在示例问题中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在学习汇编器一段时间，我试图重写一些简单的程序\函数来查看性能优势（如果有的话）。我的主要开发工具是Delphi 2007，第一个例子将是这种语言，但是可以轻松地将其翻译成其他语言。

问题描述如下：

我们给出了一个无符号字节值，其中八位中的每一个表示屏幕的一行中的像素。每个单个像素可以是固体（1）或透明（0）。所以换句话说，我们有8个像素打包在一个字节值。
我想将这些像素解压缩成八字节数组，最小像素（位）将降到数组的最低索引之下，依此类推。这是一个例子：

 一个字节值----------->八字节数组
 
 10011011 -----------------> [1] [1] [0] [1] [1] [0] [0] [1] 
 
数组索引号-------> 0 1 2 3 4 5 6 7

下面我介绍了解决问题的五种方法。接下来我会显示他们的时间比较，以及我如何衡量这些时间。

我的问题包括两部分：

< h2> 1。

我要求您提供有关方法的详细答案和 DecodePixels4b 。为什么方法 4b 比 4a 更慢？

 
 
 如果比较慢，因为我的代码没有正确对齐，那么显示给定方法中的哪些指令可以更好地对齐，如何做到这一点不会破坏方法。
 
 
我想看看理论背后的真实例子。请记住，我正在学习大会，我想从你的答案中获得知识，这让我以后能够编写更好的优化代码。
 
 
  2。
 
 
 你可以写出比 DecodePixels4a 更快的例程吗？如果是，请提出并描述您采取的优化步骤。 
通过更快的例程我的意思是在您所在的测试环境中，在最短的时间内运行的例程。
 
 
 所有英特尔家庭处理器都是允许的，并且与它们兼容。
 
 
 下面你会发现我写的例程：
  procedure DecodePixels1（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 var 
 i3：整数; 
 begin 
 DecPixels [0]：= EncPixels和$ 01; 
 for i3：= 1到7 do 
 begin 
 EncPixels：= EncPixels shr 1; 
 DecPixels [i3]：= EncPixels和$ 01; 
 // DecPixels [i3]：=（EncPixels shr i3）和$ 01; //如果用
 end替换2行以上，这会更慢; 
结束
 
 
 //让我们展开循环，看看它是否会更快。 
 procedure DecodePixels2（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 begin 
 DecPixels [0]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [1]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [2]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [3]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [4]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [5]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [6]：= EncPixels和$ 01; 
 EncPixels：= EncPixels shr 1; 
 DecPixels [7]：= EncPixels和$ 01; 
结束
 
 
程序DecodePixels3（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 begin 
 asm 
 push eax; 
推ebx; 
 push ecx; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx]，bl; 
 mov ecx，$ 00; 
 @@解码：
 inc ecx; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + ecx]，bl; 
 cmp ecx，$ 07; 
 jnz @@解码; 
 pop ecx; 
 pop ebx; 
 pop eax; 
结束
结束
 
 
 //展开的汇编循环
程序DecodePixels4a（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 begin 
 asm 
 push eax; 
推ebx; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 01]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 02]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 03]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 04]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 05]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 06]，bl; 
 shr al，$ 01; 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 07]，bl; 
 pop ebx; 
 pop eax; 
结束
结束
 
 
 //与4a相比，只有在切换两个指令（但是7次）时才有差异
 procedure DecodePixels4b（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 begin 
 asm 
 push eax; 
推ebx; 
 mov bl，al; 
和bl，$ 01; 
 shr al，$ 01; // 
 mov [edx]，bl; // 
 mov bl，al; 
和bl，$ 01; 
 shr al，$ 01; // 
 mov [edx + $ 01]，bl; // 
 mov bl，al; 
和bl，$ 01; 
 shr al，$ 01; // 
 mov [edx + $ 02]，bl; // 
 mov bl，al; 
和bl，$ 01; 
 shr al，$ 01; // 
 mov [edx + $ 03]，bl; // 
 mov bl，al; 
和b1，$ 01; 
 shr al，$ 01; // 
 mov [edx + $ 04]，bl; // 
 mov bl，al; 
和bl，$ 01; 
 shr al，$ 01; // 
 mov [edx + $ 05]，bl; // 
 mov bl，al; 
和bl，$ 01; 
 shr al，$ 01; // 
 mov [edx + $ 06]，bl; // 
 mov bl，al; 
和bl，$ 01; 
 mov [edx + $ 07]，bl; 
 pop ebx; 
 pop eax; 
结束
结束
  
这里是如何测试他们的：
 程序测试; 
 
 {$ APPTYPE CONSOLE} 
 
使用
 SysUtils，Windows; 
 
 type 
 TDecodedPixels = Byte的数组[0..7] 
 var 
像素：TDecodedPixels; 
 Freq，TimeStart，TimeEnd：Int64; 
 Time1，Time2，Time3，Time4a，Time4b：Extended; 
 i，i2：整数; 
 
 begin 
如果QueryPerformanceFrequency（Freq）然后
开始
为i2：= 1到100 do 
 begin 
 QueryPerformanceCounter（TimeStart）; 
 for i：= 1 to 100000 do 
 DecodePixels1（155，Pixels）; 
 QueryPerformanceCounter（TimeEnd）; 
 Time1：= Time1 +（（TimeEnd  -  TimeStart）/ Freq * 1000）; 
 
 QueryPerformanceCounter（TimeStart）; 
 for i：= 1 to 100000 do 
 DecodePixels2（155，Pixels）; 
 QueryPerformanceCounter（TimeEnd）; 
 Time2：= Time2 +（（TimeEnd  -  TimeStart）/ Freq * 1000）; 
 
 QueryPerformanceCounter（TimeStart）; 
 for i：= 1 to 100000 do 
 DecodePixels3（155，Pixels）; 
 QueryPerformanceCounter（TimeEnd）; 
 Time3：= Time3 +（（TimeEnd  -  TimeStart）/ Freq * 1000）; 
 
 QueryPerformanceCounter（TimeStart）; 
 for i：= 1 to 100000 do 
 DecodePixels4a（155，Pixels）; 
 QueryPerformanceCounter（TimeEnd）; 
 Time4a：= Time4a +（（TimeEnd  -  TimeStart）/ Freq * 1000）; 
 
 QueryPerformanceCounter（TimeStart）; 
 for i：= 1到100000 do 
 DecodePixels4b（155，Pixels）; 
 QueryPerformanceCounter（TimeEnd）; 
 Time4b：= Time4b +（（TimeEnd  -  TimeStart）/ Freq * 1000）; 
 
 end; 
 Writeln（'Time1：'+ FloatToStr（Time1 / 100）+'ms。<  -  Delphi循环。 
 Writeln（'Time2：'+ FloatToStr（Time2 / 100）+'ms。<  -  Delphi展开循环。 
 Writeln（'Time3：'+ FloatToStr（Time3 / 100）+'ms。<  -  BASM循环。 
 Writeln（'Time4a：'+ FloatToStr（Time4a / 100）+'ms。<  -  BASM展开循环。 
 Writeln（'Time4b：'+ FloatToStr（Time4b / 100）+'ms。<  -   -  BASM展开循环指令开关。 
结束
 Readln; 
结束。 
  
以下是我的机器（Win32 XP上的英特尔®奔腾®E2180）的结果：
  Time1：1,68443549919493 ms。 <  - 德尔福循环。 
时间2：1,33773024572211 ms。 <  -  Delphi展开循环。 
时间3：1,37015271374424 ms。 <  -  BASM循环。 
 Time4a：0,822916962526627 ms。 <  -  BASM展开循环。 
 Time4b：0,862914462301607 ms。 <  -  BASM展开循环指令开关。 
  
结果相当稳定 - 每次测试之间的时间变化只有很小的百分比。这一直是真的： Time1> Time3>时间2> Time4b> Time4a  
 
 
 所以我认为在Time4a和Time4b之间的区别取决于该指令切换方法 DecodePixels4b 。有时候是4％，有时最多可达10％，但 4b 始终比 4a 更慢。
 
 
 我正在考虑使用MMX指令一次写入内存八个字节的另一种方法，但是我无法找出将字节解压缩到64位寄存器的快速方法。 
 
 
 感谢您的时间。
 
 
 
 
 
 谢谢你们为您的宝贵意见。我可以在同一时间回答所有的人，不幸的是，与现代CPU相比，我只有一个管道，当时只能执行一个指令回复;-) 
所以，我会尝试总和一些事情在这里，并在您的答案下写入额外的意见。
 
 
 首先，我想说，在发布我的问题之前，我想出了Wouter提出的解决方案van Nifterick，而实际上方式较慢然后是我的汇编代码。 
所以我决定不在这里发布这个例程，但你可能会看到我也采用了相同的方法，在我的循环Delphi版本的例程。它在那里被评论，因为它给我更糟的结果。 
 
 
 这对我来说是一个谜。我再次使用Wouter和PhilS的例程运行我的代码，结果如下：
  Time1：1,66535493194387 ms。 <  - 德尔福循环。 
时间2：1,29115785420688 ms。 <  -  Delphi展开循环。 
时间3：1,33716934524107 ms。 <  -  BASM循环。 
 Time4a：0,795041753757838 ms。 <  -  BASM展开循环。 
 Time4b：0,843520166815013 ms。 <  -  BASM展开循环指令开关。 
时间5：1,49457681191307 ms。 <  -  Wouter van Nifterick，Delphi展开
 Time6：0,400587402866258 ms。 <  -  PhiS，表查找Delphi 
 Time7：0,325472442519827 ms。 <  -  PhiS，表查找Delphi inline 
 Time8：0,37350491544239 ms。 <  -  PhiS，表查找BASM 
  
看看Time5的结果，相当奇怪不是吗？ 
我想我有不同的Delphi版本，因为我生成的汇编代码与Wouter提供的汇编代码不同。
 
 
  第二个主要修改： 
 
 
 
 
 
 我知道为什么例行 5 在我的机器上较慢。我在我的编译器选项中检查了范围检查和溢出检查。我已经将汇编程序指令添加到例程 9 中，以查看它是否有帮助。看来，这个指令汇编程序与Delphi内联变体一样好，甚至更好一些。 
 
 
 以下是最终结果：
  Time1：1,22508325749317女士。 <  - 德尔福循环。 
时间2：1,33004145373084 ms。 <  -  Delphi展开循环。 
时间3：1,1473583622526 ms。 <  -  BASM循环。 
 Time4a：0,77322594033463 ms。 <  -  BASM展开循环。 
 Time4b：0,846033593023372 ms。 <  -  BASM展开循环指令开关。 
 Time5：0,688689382044384 ms。 <  -  Wouter van Nifterick，Delphi展开
 Time6：0,503233741036693 ms。 <  -  PhiS，表查找Delphi 
 Time7：0,385254722925063 ms。 <  -  PhiS，表查找Delphi inline 
 Time8：0,432993919452751 ms。 <  -  PhiS，表查找BASM 
 Time9：0,362680491244212 ms。 <  -  PhiS，表查找BASM与汇编器指令
  
 第三主要编辑： 
 
 
 
 
 
 在意见@Pascal Cuoq和@j_random_hacker之间执行时间之间的差异 4a ， 4b 和 5 是由数据依赖性引起的。然而，我不得不基于我进一步的测试，这个意见。
 
 
 我还发明了新的例程 4c 基于 4a 。这里是：
  procedure DecodePixels4c（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 begin 
 asm 
 push ebx; 
 mov bl，al; 
和bl，1; 
 mov [edx]，bl; 
 mov bl，al; 
 shr bl，1; 
和bl，1; 
 mov [edx + $ 01]，bl; 
 mov bl，al; 
 shr bl，2; 
和bl，1; 
 mov [edx + $ 02]，bl; 
 mov bl，al; 
 shr bl，3; 
和bl，1; 
 mov [edx + $ 03]，bl; 
 mov bl，al; 
 shr bl，4; 
和bl，1; 
 mov [edx + $ 04]，bl; 
 mov bl，al; 
 shr bl，5; 
和bl，1; 
 mov [edx + $ 05]，bl; 
 mov bl，al; 
 shr bl，6; 
和bl，1; 
 mov [edx + $ 06]，bl; 
 shr al，7; 
和al，1; 
 mov [edx + $ 07]，al; 
 pop ebx; 
结束
结束
  
我会说这是很依赖数据。
 
 
 这里是测试和结果。我做了四个测试，以确保没有意外。 
我还为GJ（Time10a，Time10b）提出的例程添加了新的时间。
  Test1 Test2 Test3 Test4 
 
时间1：1,211 1,210 1,220 1,213 
时间2：1,280 1,258 1,253 1,332 
时间3：1,129 1,138 1,130 1,160 
 
 Time4a：0,690 0,682 0,617 0,635 
 Time4b：0,707 0,698 0,706 0,659 
时间4c：0,679 0,685 0,626 0,625 
时间5：0,715 0,682 0,686 0,679 
 
时间6：0,490 0,485 0,522 0,514 
时间7：0,323 0,333 0,336 0,318 
时间8：0,407 0,403 0,373 0,354 
时间9：0,352 0,378 0,355 0,355 
时间10a：1,823 1,812 1,807 1,813 
时间10b：1,113 1,120 1,115 1,118 
时间10c：0,652 0,630 0,653 0,633 
时间10d：0,156 0,155 0,172 0,160<  - 当前胜者！ 
  
您可能会看到 4a  ， 4b ， 4c 和 5 非常接近每个其他。 
为什么？因为我已经从4a中删除了 ，4b（4c已经没有）两个说明： push eax 和流行eax 。因为我知道我不会在我的代码中的任何地方使用eax下的价值，我不必预先保留。 
现在我的代码只有一对push / pop，如例程5. 
例程5预先确定eax的值，它首先在ecx下复制它，但不能预先保留ecx。所以我的结论是：5和4a和4b（第三编辑之前）的时间执行差异不涉及数据依赖，而是由额外的
 
 
 我对您的意见非常感兴趣。
 
 
 之后几天GJ比PhiS发明了更快的时间（时间10d）。好的工作GJ！
解决方案
您的asm代码是相对较慢的，因为使用堆栈结束写入8次内存。 
检查这一个... 
  procedure DecodePixels（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 asm 
 xor ecx，ecx 
 add al，al 
 rcl ecx，8 
 add al，al 
 rcl ecx，8 
 add al 
 rcl ecx，8 
 add al，al 
 rcl ecx，1 
 mov [DecPixels + 4]，ecx 
 xor ecx，ecx 
 add al，al 
 rcl ecx，8 
 add al，al 
 rcl ecx，8 
 add al，al 
 rcl ecx，8 
 add al 
 rcl ecx，1 
 mov [DecPixels]，ecx 
 end; 
  
也许比查找表的代码要快！
 
 
 改进版本：
  procedure DecodePixelsI（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 asm 
 mov ecx，0 //更快：xor ecx，ecx 
添加al，al 
 rcl ch，1 
 add al，al 
 rcl cl，1 
 ror ecx，16 
 add al，al 
 rcl ch，1 
 add al，al 
 rcl cl，1 
 mov [ DecPixels + 4]，ecx 
 mov ecx，0 //快于：xor ecx，ecx 
 add al，al 
 rcl ch，1 
 add al，al 
 rcl cl，1 
 ror ecx，16 
 add al，al 
 rcl ch，1 
 add al，al 
 rcl cl，1 
 mov [DecPixels]，ecx 
 end; 
  
版本3：
  procedure DecodePixelsX（EncPixels：Byte; var DecPixels：TDecodedPixels）; 
 asm 
 add al，al 
 setc byte ptr [DecPixels + 7] 
 add al，al 
 setc byte ptr [DecPixels + 6] 
 add al，al 
 setc byte ptr [DecPixels + 5] 
 add al，al 
 setc byte ptr [DecPixels + 4] 
 add al，al 
 setc byte ptr [DecPixels + 3] 
 add al，al 
 setc byte ptr [DecPixels + 2] 
 add al，al 
 setc byte ptr [DecPixels + 1] 
 setnz byte ptr [DecPixels] 
 end; 
  
版本4：
 （
 $ 00000000，$ 00000001，$ 00000100，$ 00000101，
 $ 00010000，$ 00010001，$ 00010100，$ 00010101，
 $ 01000000  const Uint32DecPix：array [0..15] $ 01000001，$ 01000100，$ 01000101，
 $ 01010000，$ 01010001，$ 01010100，$ 01010101 
）; 
 
程序DecodePixelsY（EncPixels：byte; var DecPixels：TDecodedPixels）;一致; 
 begin 
 pcardinal（@DecPixels）^：= Uint32DecPix [EncPixels和$ 0F]; 
 pcardinal（cardinal（@DecPixels）+ 4）^：= Uint32DecPix [（EncPixels和$ F0）shr 4]; 
结束
  
 
I am learning assembler quite a while and I am trying to rewrite some simple procedures \ functions to it to see performance benefits (if any). My main development tool is Delphi 2007 and first examples will be in that language but they can be easily translated to other languages as well.

The problem states as:

We have given an unsigned byte value in which each of the eight bits represents a pixel in one row of a screen. Each single pixel can be solid (1) or transparent (0). So in other words, we have 8 pixels packed in one byte value.
I want to unpack those pixels into an eight byte array in the way that youngest pixel(bit) will land under the lowest index of the array and so on. Here is an example:
One byte value -----------> eight byte array

10011011 -----------------> [1][1][0][1][1][0][0][1]

Array index number ------->  0  1  2  3  4  5  6  7
Below I present five methods which are solving the problem. Next I will show their time comparison and how I did measure those times.

My questions consist of two parts:

1.

I am asking you for detailed answer concerning methods DecodePixels4a and DecodePixels4b. Why method 4b is somewhat slower than the 4a?

If for example it is slower because my code is not aligned correctly then show me which instructions in a given method could be better aligned and how to do this to not break the method.

I would like to see real examples behind the theory. Please bear in mind that I am learning assembly and I want to gain knowledge from your answers which allows me in the future to writing better optimized code.

2.

Can you write faster routine than DecodePixels4a? If so, please present it and describe optimization steps that you have taken.
By faster routine I mean routine that runs in the shortest period of time in your test environment among all the routines presented here.

All Intel family processors are allowed and those which are compatible with them.

Below you will find routines written by me:
procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
  i3: Integer;
begin
  DecPixels[0] := EncPixels and $01;
  for i3 := 1 to 7 do
  begin
    EncPixels := EncPixels shr 1;
    DecPixels[i3] := EncPixels and $01;
    //DecPixels[i3] := (EncPixels shr i3) and $01;  //this is even slower if you replace above 2 lines with it
  end;
end;


//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[1] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[2] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[3] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[4] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[5] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[6] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[7] := EncPixels and $01;
end;


procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    push ecx;
    mov bl, al;
    and bl, $01;
    mov [edx], bl;
    mov ecx, $00;
@@Decode:
    inc ecx;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + ecx], bl;
    cmp ecx, $07;
    jnz @@Decode;
    pop ecx;
    pop ebx;
    pop eax;
  end;
end;


//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    mov  [edx], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $01], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $02], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $03], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $04], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $05], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $06], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;


// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx], bl;        //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $01], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $02], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $03], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $04], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $05], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $06], bl;  //
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;
And here is how do I test them:
program Test;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

type
  TDecodedPixels = array[0..7] of Byte;
var
  Pixels: TDecodedPixels;
  Freq, TimeStart, TimeEnd :Int64;
  Time1, Time2, Time3, Time4a, Time4b: Extended;
  i, i2: Integer;

begin
  if QueryPerformanceFrequency(Freq) then
  begin
    for i2 := 1 to 100 do
    begin
      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels1(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels2(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels3(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4a(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4b(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);

    end;
    Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms.    <- Delphi loop.');
    Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms.    <- Delphi unrolled loop.');
    Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms.    <- BASM loop.');
    Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms.    <- BASM unrolled loop.');
    Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms.    <- BASM unrolled loop instruction switch.');
  end;
  Readln;
end.
Here are the results from my machine ( Intel® Pentium® E2180 on Win32 XP) :
Time1  : 1,68443549919493 ms.     <- Delphi loop.
Time2  : 1,33773024572211 ms.     <- Delphi unrolled loop.
Time3  : 1,37015271374424 ms.     <- BASM loop.
Time4a : 0,822916962526627 ms.    <- BASM unrolled loop.
Time4b : 0,862914462301607 ms.    <- BASM unrolled loop instruction switch.
The results are pretty stable - times vary only by few percent between each test I've made. And that was always true: Time1 > Time3 > Time 2 > Time4b > Time4a 

So I think that de difference between Time4a and Time4b depends of that instructions switch in the method DecodePixels4b. Sometimes it is 4% sometimes it is up to 10% but 4b is always slower than 4a.

I was thinking about another method with usage of MMX instructions to write into memory eight bytes at one time, but I can't figure out fast way to unpack byte into the 64 bit register.

Thank you for your time.



Thank you guys for your valuable input. Whish I could answer all of you at the same time, unfortunately compared to the modern CPU's I have only one "pipe" and can execute only one instruction "reply" at the time ;-)
So, I will try sum up some things over here and write additional comments under your answers.

First of all, I wanted to say that before posting my question I came up with the solution presented by Wouter van Nifterick and it was actually way slower then my assembly code.
So I've decided not to post that routine here, but you may see that I took the same approach also in my loop Delphi version of the routine. It is commented there because it was giving me worser results. 

This is a mystery for me. I've run my code once again with Wouter's and PhilS's routines and here are the results:
Time1  : 1,66535493194387 ms.     <- Delphi loop.
Time2  : 1,29115785420688 ms.     <- Delphi unrolled loop.
Time3  : 1,33716934524107 ms.     <- BASM loop.
Time4a : 0,795041753757838 ms.    <- BASM unrolled loop.
Time4b : 0,843520166815013 ms.    <- BASM unrolled loop instruction switch.
Time5  : 1,49457681191307 ms.     <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,400587402866258 ms.    <- PhiS, table lookup Delphi
Time7  : 0,325472442519827 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,37350491544239 ms.     <- PhiS, table lookup BASM
Look at the Time5 result, quite strange isn't it?
I guess I have different Delphi version, since my generated assembly code differs from that provided by Wouter.

Second major edit:



I know why routine 5 was slower on my machnie. I had checked "Range checking" and "Overflow checking" in my compiler options. I've added assembler directive to routine 9 to see if it helps. It seems that with this directive assembly procedure is as good as Delphi inline variant or even slightly better. 

Here are the final results:
Time1  : 1,22508325749317 ms.     <- Delphi loop.
Time2  : 1,33004145373084 ms.     <- Delphi unrolled loop.
Time3  : 1,1473583622526 ms.      <- BASM loop.
Time4a : 0,77322594033463 ms.     <- BASM unrolled loop.
Time4b : 0,846033593023372 ms.    <- BASM unrolled loop instruction switch.
Time5  : 0,688689382044384 ms.    <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,503233741036693 ms.    <- PhiS, table lookup Delphi
Time7  : 0,385254722925063 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,432993919452751 ms.    <- PhiS, table lookup BASM
Time9  : 0,362680491244212 ms.    <- PhiS, table lookup BASM with assembler directive
Third major edit:



In opinion @Pascal Cuoq and @j_random_hacker the difference in execution times between routines 4a, 4b and 5 is caused by the data dependency. However I have to disagree with that opinion basing on the further tests that I've made.

I've also invented new routine 4c based on 4a. Here it is:
procedure DecodePixels4c(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push ebx;
    mov bl, al;
    and bl, 1;
    mov [edx], bl;
    mov bl, al;
    shr bl, 1;
    and bl, 1;
    mov [edx + $01], bl;
    mov bl, al;
    shr bl, 2;
    and bl, 1;
    mov [edx + $02], bl;
    mov bl, al;
    shr bl, 3;
    and bl, 1;
    mov [edx + $03], bl;
    mov bl, al;
    shr bl, 4;
    and bl, 1;
    mov [edx + $04], bl;
    mov bl, al;
    shr bl, 5;
    and bl, 1;
    mov [edx + $05], bl;
    mov bl, al;
    shr bl, 6;
    and bl, 1;
    mov [edx + $06], bl;
    shr al, 7;
    and al, 1;
    mov [edx + $07], al;
    pop ebx;
  end;
end;
I would say it is pretty data dependent.

And here are the tests and results. I've made four tests to make sure there is no accident.
I've also added new times for the routines proposed by GJ (Time10a, Time10b).
          Test1  Test2  Test3  Test4

Time1   : 1,211  1,210  1,220  1,213
Time2   : 1,280  1,258  1,253  1,332
Time3   : 1,129  1,138  1,130  1,160

Time4a  : 0,690  0,682  0,617  0,635
Time4b  : 0,707  0,698  0,706  0,659
Time4c  : 0,679  0,685  0,626  0,625
Time5   : 0,715  0,682  0,686  0,679

Time6   : 0,490  0,485  0,522  0,514
Time7   : 0,323  0,333  0,336  0,318
Time8   : 0,407  0,403  0,373  0,354
Time9   : 0,352  0,378  0,355  0,355
Time10a : 1,823  1,812  1,807  1,813
Time10b : 1,113  1,120  1,115  1,118
Time10c : 0,652  0,630  0,653  0,633
Time10d : 0,156  0,155  0,172  0,160  <-- current winner!
As you may see the results of 4a, 4b, 4c and 5 are very close to each other.
Why is that? Because I've removed from 4a, 4b (4c already doesn't have it) two instructions: push eax and pop eax. Since I know I wont use anywhere else in my code the value under eax I do not have to prereserve it. 
Now my code has only one pair of push/pop so as the routine 5.
Routine 5 prereserves value of eax beacause it firstly make copy of it under ecx but it deson't prereserve ecx.

So my conclusion is that: the difference in time execution of 5 and 4a and 4b (before the third edit) didn't concern data dependecny but was caused by additional pair of push / pop instructions.

I am very interested in your comments.

After a few days GJ invented even faster routine (Time 10d) than PhiS's. Nice work GJ!
 解决方案 
Your asm code is relativity slow because use stack end write 8 times to memory.
Check this one...
procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;
Maybe is even faster than code with lookup table!

Improved version:
procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;
Version 3:
procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;
Version 4:
const Uint32DecPix : array [0..15] of cardinal = (
  $00000000, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and $0F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;


                        
这篇关于英特尔x86组装优化技术在示例问题中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

英特尔x86组装优化技术在示例问题中 [英] Intel x86 assembly optimization techniques in a sample problem

问题描述

2。

1.

2.

相关文章

Delphi最新文章

热门教程

热门工具

登录关闭

英特尔x86组装优化技术在示例问题中 [英] Intel x86 assembly optimization techniques in a sample problem

问题描述

2。

1.

2.

相关文章

Delphi最新文章

热门教程

热门工具

登录 关闭

登录关闭