如何有效旋转代码中的位图 [英] How to efficiently rotate bitmaps in code

查看:72
本文介绍了如何有效旋转代码中的位图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有比简单地使用倒置坐标进行嵌套循环更快的方法来将 large 位图旋转90或270度?

位图为8bpp,通常为2048 2400 8bpp

目前,我是通过简单地复制参数反转来实现的,大致(伪代码:

 对于x = 0到2048-1对于y = 0至2048-1dest [x] [y] = src [y] [x]; 

(实际上,我使用指针来完成,速度有所提高,但幅度大致相同)

对于大图像,GDI相当慢,并且纹理(GF7卡)的GPU加载/存储时间与当前CPU时间相同.

任何提示,指针吗?就地算法甚至会更好,但是速度比就地算法更重要.

目标是Delphi,但这更多是一个算法问题.SSE(2)矢量化没问题,对我来说用汇编器对其进行编码已经足够大了


关注Nils的回答

  • 图片2048x2700->2700x2048
  • 具有优化功能的Compiler Turbo Explorer 2006.
  • Windows:将电源方案设置为始终开启".(重要!!!! )
  • 机器:Core2 6600(2.4 GHz)

使用旧例程的时间:32ms(步骤1)

逐步调整时间8:12ms

步进大小的时间16:10ms

步长为32+的时间:9毫秒

与此同时,我还在Athlon 64 X2(5200+ iirc)上进行了测试,其速度略高于四分之一(80至19毫秒).

提高速度非常值得,谢谢.也许在夏季,我会用SSE(2)版本折磨自己.但是我已经考虑过如何解决这个问题,并且我认为SSE2寄存器将用光,无法直接实现:

 对于n:= 0到7开始加载r0,< source + n * rowsize>将字节从r0移到r1将字节从r0移到r2..将字节从r0移到r8结尾;存储r1,< target>存储r2,< target + 1 *< rowsize>..存储r8,< target + 7 *< rowsize> 

所以8x8需要9个寄存器,但是32位SSE只有8个寄存器.无论如何,这对于夏季来说是:-)

请注意,指针是我本能所做的事情,但实际上可能有一些事情,如果未对维进行硬编码,则编译器无法将mul转变.尽管如今muls sich很便宜,但它们也会产生更多的套准压力afaik.

代码(通过从"naieve" rotate1实现中减去结果来验证):

  const stepsize = 32;程序rotatealign(来源:tbw8image;目标:tbw8image);var stepsx,stepsy,restx,resty:整数;RowPitchSource,RowPitchTarget:整数;pSource,pTarget,ps1,ps2:pchar;x,y,i,j:整数;rpstep:整数;开始RowPitchSource:= source.RowPitch;//字节跳到下一行.可以为负(包括对齐)RowPitchTarget:= target.RowPitch;rpstep:= RowPitchTarget * stepsize;stepsx:= source.ImageWidth div步长;stepsy:= source.ImageHeight div步长;//检查两个维度的mod 16 = 0是否为->SSE2.对于y:= 0到stepsy-1做开始psource:= source.GetImagePointer(0,y * stepsize);//获取指向像素x,y的指针ptarget:= Target.GetImagePointer(target.imagewidth-(y + 1)* stepsize,0);对于x:= 0到stepx-1做开始对于我:= 0逐步调整大小-1做开始ps1:= @ psource [rowpitchsource * i];//(0,i)ps2:= @ ptarget [stepsize-1-i];//(maxx-i,0);对于j:= 0逐步调整大小-1做开始ps2 [0]:= ps1 [j];inc(ps2,RowPitchTarget);结尾;结尾;inc(psource,stepsize);inc(ptarget,rpstep);结尾;结尾;//还有3个区域,具有尺寸//-stepsy * stepsize * restx//restx宽度的最右列//-stepsx * stepsize * resty//底部高度为resty的行//-restx * resty//右下角的矩形.restx:= source.ImageWidth mod步长;//通常为零,因为宽度为//通常为1024或2048resty:= source.Imageheight mod逐步调整大小;如果restx> 0,则开始//少一个循环,因为我们知道这适合一行"blocks".psource:= source.GetImagePointer(source.ImageWidth-restx,0);//获取指向像素x,y的指针ptarget:= Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);对于y:= 0到stepsy-1做开始对于我:= 0逐步调整大小-1做开始ps1:= @ psource [rowpitchsource * i];//(0,i)ps2:= @ ptarget [stepsize-1-i];//(maxx-i,0);对于j:= 0到restx-1做开始ps2 [0]:= ps1 [j];inc(ps2,RowPitchTarget);结尾;结尾;inc(psource,stepsize * RowPitchSource);dec(ptarget,stepsize);结尾;结尾;如果resty> 0,那么开始//少一个循环,因为我们知道这适合一行"blocks".psource:= source.GetImagePointer(0,source.ImageHeight-resty);//获取指向像素x,y的指针ptarget:= Target.GetImagePointer(0,0);对于x:= 0到stepx-1做开始对于我:= 0到resty-1做开始ps1:= @ psource [rowpitchsource * i];//(0,i)ps2:= @ ptarget [resty-1-i];//(maxx-i,0);对于j:= 0逐步调整大小-1做开始ps2 [0]:= ps1 [j];inc(ps2,RowPitchTarget);结尾;结尾;inc(psource,stepsize);inc(ptarget,rpstep);结尾;结尾;如果(resty> 0)和(restx> 0),则开始//减少另一个循环,因为只有一个块psource:= source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);//获取指向像素x,y的指针ptarget:= Target.GetImagePointer(0,target.ImageHeight-restx);对于我:= 0到resty-1做开始ps1:= @ psource [rowpitchsource * i];//(0,i)ps2:= @ ptarget [resty-1-i];//(maxx-i,0);对于j:= 0到restx-1做开始ps2 [0]:= ps1 [j];inc(ps2,RowPitchTarget);结尾;结尾;结尾;结尾; 

更新2个泛型

我试图在Delphi XE中将此代码更新为通用版本.由于QC 99703,我失败了,论坛人员已经确认XE2中也存在该问题.请投票:-)

更新3个泛型现在可以在XE10中使用

更新4

2017年,我为仅8x8立方的8bpp图像的汇编程序版本进行了一些工作和相关的解决方案

是的,有更快的方法.

您的简单循环将大部分时间用于缓存未命中.发生这种情况是因为您在紧密的循环中在非常不同的位置触摸了很多数据.更糟糕的是:您的内存位置恰好是两个的幂.这是缓存性能最差的大小.

如果改善内存访问的局部性,则可以改进此循环算法.

一种简单的方法是使用与整个位图相同的代码自行旋转每个8x8像素块,并包装另一个循环,将图像旋转分成每个8x8像素的块./p>

例如像这样的东西(未选中,对不起C代码.我的Delphi技能不是最新的):

 ////这是中断图像旋转的外循环//分成每个8x8像素的块:对于(int block_x = 0; block_x< 2048; block_x + = 8){对于(int block_y = 0; blocky_y< 2048; block_y + = 8){//这是处理块的内部循环//8x8像素.为(int x = 0; x< 8; x ++)对于(int y = 0; y< 8; y ++)dest [x + block_x] [y + block_y] = src [y + block_y] [x + block_x]}} 

还有其他方法.您可以按Hilbert-Order或Morton-Order处理数据.从理论上讲,这甚至会更快一些,但是代码会更加复杂.

顺便说一句-由于您已经提到SSE是您的选择.请注意,您可以在SSE寄存器中旋转8x8字节的块.要使其正常工作有点棘手,但是查看SSE矩阵转置代码应该可以使您入门,因为它是同一回事.


只需检查:

代码块大小为8x8像素,运行时间约为.在我的机器上快5倍.块大小为16x16,运行速度快10倍.

似乎最好尝试使用不同的块大小.

这是我使用的(非常简单的)测试程序:

  #include< stdio.h>#include< windows.h>char temp1 [2048 * 2048];字符temp2 [2048 * 2048];void rotation1(无效){整数x,y;对于(y = 0; y <2048; y ++)对于(x = 0; x< 2048; x ++)temp2 [2048 * y + x] = temp1 [2048 * x + y];}无效rotate2(无效){整数x,y;int bx,由;对于(by = 0; by< 2048; by + = 8)对于(bx = 0; bx <2048; bx + = 8)对于(y = 0; y <8; y ++)对于(x = 0; x< 8; x ++)temp2 [2048 *(y + by)+ x + bx] = temp1 [2048 *(x + bx)+ y + by];}无效rotate3(无效){整数x,y;int bx,由;对于(by = 0; by< 2048; by + = 16)对于(bx = 0; bx <2048; bx + = 16)对于(y = 0; y <16; y ++)对于(x = 0; x< 16; x ++)temp2 [2048 *(y + by)+ x + bx] = temp1 [2048 *(x + bx)+ y + by];}int main(int argc,char ** args){int,t1;t1 = GetTickCount();对于(i = 0; i <20; i ++)rotate1();printf(%d \ n",GetTickCount()-t1);t1 = GetTickCount();对于(i = 0; i <20; i ++)rotate2();printf(%d \ n",GetTickCount()-t1);t1 = GetTickCount();对于(i = 0; i <20; i ++)rotate3();printf(%d \ n",GetTickCount()-t1);} 

Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?

The bitmaps are 8bpp and typically 204824008bpp

Currently I do this by simply copying with argument inversion, roughly (pseudo code:

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)

GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.

Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.

Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler


Follow up to Nils' answer

  • Image 2048x2700 -> 2700x2048
  • Compiler Turbo Explorer 2006 with optimization on.
  • Windows: Power scheme set to "Always on". (important!!!!)
  • Machine: Core2 6600 (2.4 GHz)

time with old routine: 32ms (step 1)

time with stepsize 8 : 12ms

time with stepsize 16 : 10ms

time with stepsize 32+ : 9ms

Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).

The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation:

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>   

So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-)

Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.

The code (validated by subtracting result from the "naieve" rotate1 implementation):

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

Update 2 Generics

I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it :-)

Update 3 Generics Works now in XE10

Update 4

In 2017 i did some work on a assembler version for 8x8 cubes of 8bpp images only and related SO question about shuffle bottlenecks where Peter Cordes generously helped me out. This code still has a missed oportunity and still needs another looptiling level again to aggregate multiple 8x8 block iterations into pseudo larger ones like 64x64. Now it is whole lines again and that is wasteful.

解决方案

Yes, there are faster ways to do this.

Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.

You can improve this rotation algorithm if you improve the locality of your memory accesses.

A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.

E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 } 

There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.

Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.


EDIT:

Just checked:

With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.

Seems like it's a good idea to experiment with different block-sizes.

Here is the (very simple) test-program I've used:

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d\n", GetTickCount()-t1);

}

这篇关于如何有效旋转代码中的位图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆