旋转位图。在代码中 [英] rotating bitmaps. In code

查看:139
本文介绍了旋转位图。在代码中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



有没有比使用倒置坐标的简单的嵌套循环更快地将大型位图旋转90或270度的方法?位图是8bpp,通常是2048 * 2400 * 8bpp



目前,我通过简单地使用参数反转复制(粗体代码:


对于y = 0到2048-1
dest [x] [y] = src [$] $ x $ $ $ $ $ $ $ $ $ $ $ $ y] [x];

(实际上我用指针,但是大致相同的大小)



GDI对于大图像来说相当慢,纹理(GF7卡)的GPU加载/存储时间与当前CPU时间。



任何提示,指针?就地算法甚至会更好,但速度比原位更重要。

$目标是Delphi,但它更是一个算法问题,SSE(2)向量化没有问题,对于我来说,编译器是一个足够大的问题






跟着Nils的回答




  • 图像2048x2700 - > 2700x2048

  • 编译器Turbo Explorer 2006,优化。

  • Windows:电源方案设置为始终开启。 ( important !!!!

  • 机器:Core2 6600(2.4 GHz)



时间与旧例程:32ms(步骤1)


时间与步骤8:12ms


时间与步骤16:10ms


时间与步骤32+:9ms



同时我还测试了一个Athlon 64 X2(5200+ iirc),加速比稍微超过四分之一(80至19ms)。谢谢你的加油,这是非常值得的。也许在夏季,我会用SSE(2)版本折磨自己。但是,我已经考虑过如何解决这个问题,我想我将用完SSE2寄存器来直接实现:

 对于n:= 0到7 do 
begin
load r0,< source + n * rowsize>
将字节从r0转换为r1
将字节从r0转换为r2
..
将字节从r0转换为r8
end;
store r1,< target>
store r2,< target + 1 *< rowsize>
..
store r8,< target + 7 *< rowsize>

所以8x8需要9个寄存器,但32位SSE只有8.无论如何,夏天的月份: - )



请注意,指针的东西是我从本能出发的东西,但是实际上可能有一些东西,如果你的尺寸不是硬编码器,编译器不能转变成一个转换。这个代码(通过从naieverotate1实现中减去结果来验证):

  const stepsize = 32; 
procedure rotatealign(Source:tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty:Integer;
RowPitchSource,RowPitchTarget:Integer;
pSource,pTarget,ps1,ps2:pchar;
x,y,i,j:integer;
rpstep:integer;
begin
RowPitchSource:= source.RowPitch; //字节跳转到下一行。可以为负(包括对齐)
RowPitchTarget:= target.RowPitch; rpstep:= RowPitchTarget * stepsize;
stepsx:= source.ImageWidth div stepsize;
stepsy:= source.ImageHeight div stepsize;
//检查mod 16 = 0这两个维度,如果是 - > SSE2。
for y:= 0 to stepsy - 1 do
begin
psource:= source.GetImagePointer(0,y * stepsize); //获取指向像素x的指针,y
ptarget:= Target.GetImagePointer(target.imagewidth-(y + 1)* stepsize,0);
for x:= 0 to stepsx - 1 do
begin
for i:= 0 to stepsize - 1 do
begin
ps1:= @ psource [rowpitchsource *一世]; //(0,i)
ps2:= @ ptarget [stepsize-1-i]; //(maxx-i,0);
for j:= 0 to stepsize - 1 do
begin
ps2 [0]:= ps1 [j];
inc(ps2,RowPitchTarget);
结束
结束
inc(psource,stepsize);
inc(ptarget,rpstep);
结束
结束
// 3个更多的区域要做,尺寸为
// - stepsy * stepsize * restx //最右列的restx width
// - stepsx * stepsize * resty // bottom row与resty height
// - restx * resty // bottom-right rectangle。
restx:= source.ImageWidth mod stepsize; //通常为零,因为width为
//通常是1024或2048
resty:= source.Imageheight mod stepsize;
如果restx> 0然后
begin
//一个循环较少,因为我们知道这适合一行块
psource:= source.GetImagePointer(source.ImageWidth -restx,0); //获取指向像素x的指针,y
ptarget:= Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
for y:= 0 to stepsy - 1 do
begin
for i:= 0 to stepsize - 1 do
begin
ps1:= @ psource [rowpitchsource *一世]; //(0,i)
ps2:= @ ptarget [stepsize-1-i]; //(maxx-i,0);
for j:= 0 to restx - 1 do
begin
ps2 [0]:= ps1 [j];
inc(ps2,rowPitchTarget);
结束
结束
inc(psource,stepsize * RowPitchSource);
dec(ptarget,stepsize);
结束
结束
如果resty> 0然后
begin
//一个循环较少,因为我们知道这适合一行块
psource:= source.GetImagePointer(0,source .ImageHeight-resty); //获取指向像素x的指针,y
ptarget:= Target.GetImagePointer(0,0);
for x:= 0 to stepsx - 1 do
begin
for i:= 0 to resty- 1 do
begin
ps1:= @ psource [rowpitchsource *一世]; //(0,i)
ps2:= @ ptarget [resty-1-i]; //(maxx-i,0);
for j:= 0 to stepsize - 1 do
begin
ps2 [0]:= ps1 [j];
inc(ps2,RowPitchTarget);
结束
结束
inc(psource,stepsize);
inc(ptarget,rpstep);
结束
结束
如果(resty> 0)和(restx> 0)然后
begin
//另一个循环较少,因为只有一个块
psource:= source.GetImagePointer(source.ImageWidth -restx,source.ImageHeight-resty); //获取指向像素x的指针,y
ptarget:= Target.GetImagePointer(0,target.ImageHeight-restx);
for i:= 0 to resty- 1 do
begin
ps1:= @ psource [rowpitchsource * i]; //(0,i)
ps2:= @ ptarget [resty-1-i]; //(maxx-i,0);
for j:= 0 to restx - 1 do
begin
ps2 [0]:= ps1 [j];
inc(ps2,RowPitchTarget);
结束
结束
结束
结束

更新2泛型


$ b $我尝试将此代码更新为Delphi XE中的泛型版本。我因为QC 99703而失败了,而且论坛的人已经确认也存在于XE2中。请投票支持:-)



更新3泛型
现在在XE10中工作

解决方案

是的,有更快的方法来实现。



你的简单循环大部分时间都在缓存中错过了这样会发生,因为你在一个紧张的循环中碰到很多不同的地方的数据。更糟糕的是,你的记忆位置恰好是两分之一的力量。这是缓存性能最差的大小。



如果改善了内存访问的位置,您可以改进这种旋转算法。



一个简单的方法是使用与您完整位图相同的代码旋转每个8x8像素块,并包装另一个循环将图像旋转分割为8x8像素的大小块。



例如这样的东西(没有检查,对C代码很抱歉,我的Delphi技能不是最新的):

 $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ = 8)
{
for(int block_y = 0; blocky_y< 2048; block_y + = 8)
{
//这是处理块$的内循环b $ b //的8x8像素。 (int x = 0; x <8; x ++)
(int y = 0; y <8; y ++)
dest [x + block_x] [y + block_y] = src [y + block_y] [x + block_x]
}
}

还有其他方法。您可以处理Hilbert-Order或Morton-Order中的数据。这在理论上甚至会更快一点,但代码将会更加复杂。



Btw - 既然你提到SSE是你的选择。请注意,您可以旋转SSE寄存器中的8x8字节块。这是一个有点棘手的工作,但看看SSE矩阵转置代码应该让你开始,因为它是一样的。






编辑:



刚刚检查:



代码运行的8x8像素在我的机器上快5倍。块的大小为16x16,运行速度提高了10倍。



似乎是尝试使用不同的块大小的好主意。



这是我使用的(非常简单的)测试程序:

  #include< stdio.h> 
#include< windows.h>

char temp1 [2048 * 2048];
char temp2 [2048 * 2048];

void rotate1(void)
{
int x,y; (x = 0; x <2048; x ++)
temp2 [2048 * y + x] = temp1 [2048 * x(x = 0; y< 2048; y ++)
+ y];
}

void rotate2(void)
{
int x,y;
int bx,by; (b = 0; bx <2048; bx + = 8)
for(by = 0; by< 2048; by + = 8)
for(y = (x = 0; x< 8; x ++)
temp2 [2048 *(y + by)+ x + bx] = temp1 [2048 *(x + bx) + y + by];
}

void rotate3(void)
{
int x,y;
int bx,by; (b = 0; bx <2048; bx + = 16)
for(by = 0; by< 2048; by + = 16)
for(y = 0; (x = 0; x< 16; x ++)
temp2 [2048 *(y + by)+ x + bx] = temp1 [2048 *(x + bx) + y + by];
}


int main(int argc,char ** args)
{
int i,t1;

t1 = GetTickCount(); (i = 0; i <20; i ++)的
rotate1();
printf(%d\\\
,GetTickCount() - t1);

t1 = GetTickCount(); (i = 0; i <20; i ++)
rotate2();
printf(%d\\\
,GetTickCount() - t1);

t1 = GetTickCount(); (i = 0; i <20; i ++)的
rotate3();
printf(%d\\\
,GetTickCount() - t1);

}


Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?

The bitmaps are 8bpp and typically 2048*2400*8bpp

Currently I do this by simply copying with argument inversion, roughly (pseudo code:

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)

GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.

Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.

Target is Delphi, but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler


Follow up to Nils' answer

  • Image 2048x2700 -> 2700x2048
  • Compiler Turbo Explorer 2006 with optimization on.
  • Windows: Power scheme set to "Always on". (important!!!!)
  • Machine: Core2 6600 (2.4 GHz)

time with old routine: 32ms (step 1)

time with stepsize 8 : 12ms

time with stepsize 16 : 10ms

time with stepsize 32+ : 9ms

Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).

The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation:

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>   

So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-)

Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.

The code (validated by subtracting result from the "naieve" rotate1 implementation):

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

Update 2 Generics

I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it :-)

Update 3 Generics Works now in XE10

解决方案

Yes, there are faster ways to do this.

Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.

You can improve this rotation algorithm if you improve the locality of your memory accesses.

A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.

E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 } 

There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.

Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.


EDIT:

Just checked:

With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.

Seems like it's a good idea to experiment with different block-sizes.

Here is the (very simple) test-program I've used:

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d\n", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d\n", GetTickCount()-t1);

}

这篇关于旋转位图。在代码中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆