如何在代码中有效地旋转位图 [英] How to efficiently rotate bitmaps in code

查看:29
本文介绍了如何在代码中有效地旋转位图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有比简单地使用倒置坐标执行嵌套循环更快地将 位图旋转 90 或 270 度的方法?

Is there a faster way to rotate a large bitmap by 90 or 270 degrees than simply doing a nested loop with inverted coordinates?

位图为 8bpp,通常为 2048x2400x8bpp

The bitmaps are 8bpp and typically 2048x2400x8bpp

目前我通过简单地复制参数反转来做到这一点,大致(伪代码:

Currently I do this by simply copying with argument inversion, roughly (pseudo code:

for x = 0 to 2048-1
  for y = 0 to 2048-1
    dest[x][y]=src[y][x];

(实际上我是用指针来做的,速度要快一点,但大致相同)

(In reality I do it with pointers, for a bit more speed, but that is roughly the same magnitude)

GDI 在处理大图像时非常慢,纹理(GF7 卡)的 GPU 加载/存储时间与当前 CPU 时间相同.

GDI is quite slow with large images, and GPU load/store times for textures (GF7 cards) are in the same magnitude as the current CPU time.

任何提示,指针?就地算法会更好,但速度比就地算法更重要.

Any tips, pointers? An in-place algorithm would even be better, but speed is more important than being in-place.

Target 是 Delphi,但它更像是一个算法问题.SSE(2) 向量化没问题,对我来说用汇编代码编写它是一个足够大的问题

Target is Delp but it is more an algorithmic question. SSE(2) vectorization no problem, it is a big enough problem for me to code it in assembler

跟进尼尔斯的回答

  • 图像 2048x2700 ->2700x2048
  • 已启用优化的编译器 Turbo Explorer 2006.
  • Windows:电源方案设置为始终开启".(重要!!!!)
  • 机器:Core2 6600 (2.4 GHz)

旧例程的时间:32ms(步骤 1)

步长为 8 的时间:12ms

步长为 16 的时间:10ms

步长 32+ 的时间:9ms

time with old routine: 32ms (step 1)

time with stepsize 8 : 12ms

time with stepsize 16 : 10ms

time with stepsize 32+ : 9ms

与此同时,我还在 Athlon 64 X2 (5200+ iirc) 上进行了测试,速度提高了四倍多(80 到 19 毫秒).

Meanwhile I also tested on a Athlon 64 X2 (5200+ iirc), and the speed up there was slightly more than a factor four (80 to 19 ms).

加速是值得的,谢谢.也许在夏季的几个月里,我会用 SSE(2) 版本来折磨自己.但是我已经考虑过如何解决这个问题,我想我会用完 SSE2 寄存器来直接实现:

The speed up is well worth it, thanks. Maybe that during the summer months I'll torture myself with a SSE(2) version. However I already thought about how to tackle that, and I think I'll run out of SSE2 registers for an straight implementation:

for n:=0 to 7 do
  begin
    load r0, <source+n*rowsize> 
    shift byte from r0 into r1
    shift byte from r0 into r2
    ..
    shift byte from r0 into r8
  end; 
store r1, <target>   
store r2, <target+1*<rowsize>
..
store r8, <target+7*<rowsize>   

所以 8x8 需要 9 个寄存器,但 32 位 SSE 只有 8 个.无论如何这是夏季月份的事情:-)

So 8x8 needs 9 registers, but 32-bits SSE only has 8. Anyway that is something for the summer months :-)

请注意,指针的事情是我出于本能而做的,但实际上可能有一些东西,如果您的维度不是硬编码的,编译器无法将 mul 转换为移位.虽然现在 muls an sich 很便宜,但它们也会产生更大的注册压力 afaik.

Note that the pointer thing is something that I do out of instinct, but it could be there is actually something to it, if your dimensions are not hardcoded, the compiler can't turn the mul into a shift. While muls an sich are cheap nowadays, they also generate more register pressure afaik.

代码(通过从naieve"rotate1 实现中减去结果来验证):

The code (validated by subtracting result from the "naieve" rotate1 implementation):

const stepsize = 32;
procedure rotatealign(Source: tbw8image; Target:tbw8image);

var stepsx,stepsy,restx,resty : Integer;
   RowPitchSource, RowPitchTarget : Integer;
   pSource, pTarget,ps1,ps2 : pchar;
   x,y,i,j: integer;
   rpstep : integer;
begin
  RowPitchSource := source.RowPitch;          // bytes to jump to next line. Can be negative (includes alignment)
  RowPitchTarget := target.RowPitch;        rpstep:=RowPitchTarget*stepsize;
  stepsx:=source.ImageWidth div stepsize;
  stepsy:=source.ImageHeight div stepsize;
  // check if mod 16=0 here for both dimensions, if so -> SSE2.
  for y := 0 to stepsy - 1 do
    begin
      psource:=source.GetImagePointer(0,y*stepsize);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(target.imagewidth-(y+1)*stepsize,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
          inc(psource,stepsize);
          inc(ptarget,rpstep);
        end;
    end;
  // 3 more areas to do, with dimensions
  // - stepsy*stepsize * restx        // right most column of restx width
  // - stepsx*stepsize * resty        // bottom row with resty height
  // - restx*resty                    // bottom-right rectangle.
  restx:=source.ImageWidth mod stepsize;   // typically zero because width is 
                                          // typically 1024 or 2048
  resty:=source.Imageheight mod stepsize;
  if restx>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(source.ImageWidth-restx,0);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(Target.imagewidth-stepsize,Target.imageheight-restx);
      for y := 0 to stepsy - 1 do
        begin
          for i := 0 to stepsize - 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[stepsize-1-i];       //  (maxx-i,0);
              for j := 0 to restx - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize*RowPitchSource);
         dec(ptarget,stepsize);
       end;
    end;
  if resty>0 then
    begin
      // one loop less, since we know this fits in one line of  "blocks"
      psource:=source.GetImagePointer(0,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,0);
      for x := 0 to stepsx - 1 do
        begin
          for i := 0 to resty- 1 do
            begin
              ps1:=@psource[rowpitchsource*i];   // ( 0,i)
              ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
              for j := 0 to stepsize - 1 do
               begin
                 ps2[0]:=ps1[j];
                 inc(ps2,RowPitchTarget);
               end;
            end;
         inc(psource,stepsize);
         inc(ptarget,rpstep);
       end;
    end;
 if (resty>0) and (restx>0) then
    begin
      // another loop less, since only one block
      psource:=source.GetImagePointer(source.ImageWidth-restx,source.ImageHeight-resty);    // gets pointer to pixel x,y
      ptarget:=Target.GetImagePointer(0,target.ImageHeight-restx);
      for i := 0 to resty- 1 do
        begin
          ps1:=@psource[rowpitchsource*i];   // ( 0,i)
          ps2:=@ptarget[resty-1-i];       //  (maxx-i,0);
          for j := 0 to restx - 1 do
            begin
              ps2[0]:=ps1[j];
              inc(ps2,RowPitchTarget);
            end;
       end;
    end;
end;

更新 2 个泛型

我尝试将此代码更新为 Delphi XE 中的泛型版本.由于QC 99703我失败了,论坛人已经确认它也存在于XE2中.请投票给它:-)

I tried to update this code to a generics version in Delphi XE. I failed because of QC 99703, and forum people have already confirmed it also exists in XE2. Please vote for it :-)

更新 3 个泛型现在在 XE10 中工作

Update 3 Generics Works now in XE10

更新 4

在 2017 年,我为 8x8 立方体的 8bpp 图像做了一些工作 和相关的 SO 问题 关于 shuffle 瓶颈,Peter Cordes 慷慨地帮助了我.这段代码仍然错过了一个机会,仍然需要另一个循环平铺级别来将多个 8x8 块迭代聚合成伪更大的块迭代,如 64x64.现在又是整行,那是浪费.

In 2017 i did some work on a assembler version for 8x8 cubes of 8bpp images only and related SO question about shuffle bottlenecks where Peter Cordes generously helped me out. This code still has a missed oportunity and still needs another looptiling level again to aggregate multiple 8x8 block iterations into pseudo larger ones like 64x64. Now it is whole lines again and that is wasteful.

推荐答案

是的,有更快的方法来做到这一点.

Yes, there are faster ways to do this.

您的简单循环大部分时间都花在缓存未命中上.发生这种情况是因为您在紧密循环中在非常不同的地方接触了大量数据.更糟糕的是:您的内存位置恰好是 2 的幂.这是缓存性能最差的大小.

Your simple loop spends most of the time in cache misses. This happends because you touch a lot of data at very different places in a tight loop. Even worse: Your memory locations are exactly a power of two apart. That's a size where the cache performs worst.

如果您改进内存访问的局部性,您可以改进此轮换算法.

You can improve this rotation algorithm if you improve the locality of your memory accesses.

一种简单的方法是使用您用于整个位图的相同代码自行旋转每个 8x8 像素块,并包装另一个循环,将图像旋转分成每个 8x8 像素的块.

A simple way to do this would be to rotate each 8x8 pixel block on it's own using the same code you've used for your whole bitmap, and wrap another loop that splits the image rotation into chunks of 8x8 pixels each.

例如像这样的东西(没有检查,对不起 C 代码.我的 Delphi 技能不是最新的):

E.g. something like this (not checked, and sorry for the C-code. My Delphi skills aren't up to date):

 // this is the outer-loop that breaks your image rotation
 // into chunks of 8x8 pixels each:
 for (int block_x = 0; block_x < 2048; block_x+=8)
 {
    for (int block_y = 0; blocky_y < 2048; block_y+=8)
    { 
       // this is the inner-loop that processes a block
       // of 8x8 pixels.
       for (int x= 0; x<8; x++)
         for (int y=0; y<8; y++)
            dest[x+block_x][y+block_y] = src[y+block_y][x+block_x]
    }
 } 

还有其他方法.您可以按 Hilbert-Order 或 Morton-Order 处理数据.这在理论上甚至会更快一点,但代码会复杂得多.

There are other ways as well. You could process the data in Hilbert-Order or Morton-Order. That would be in theory even a bit faster, but the code will be much more complex.

顺便说一句 - 由于您提到 SSE 是您的一个选择.请注意,您可以在 SSE 寄存器内旋转 8x8 字节块.让它工作有点棘手,但查看 SSE 矩阵转置代码应该会让你开始,因为它是一样的.

Btw - Since you've mentioned that SSE is an option for you. Note that you can rotate a 8x8 byte block within the SSE-registers. It's a bit tricky to get it working, but looking at SSE matrix transpose code should get you started as it's the same thing.

刚刚检查:

代码块大小为 8x8 像素,运行时间约为.在我的机器上快了 5 倍.块大小为 16x16 时,它的运行速度提高了 10 倍.

With a block-size of 8x8 pixels the code runs ca. 5 times faster on my machine. With a block-size of 16x16 it runs 10 times faster.

似乎尝试不同的块大小是个好主意.

Seems like it's a good idea to experiment with different block-sizes.

这是我使用过的(非常简单的)测试程序:

Here is the (very simple) test-program I've used:

#include <stdio.h>
#include <windows.h>

char temp1[2048*2048];
char temp2[2048*2048];

void rotate1 (void)
{
  int x,y;
  for (y=0; y<2048; y++)
  for (x=0; x<2048; x++)
    temp2[2048*y+x] = temp1[2048*x+y];
}

void rotate2 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=8)
  for (bx=0; bx<2048; bx+=8)
  for (y=0; y<8; y++)
  for (x=0; x<8; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}

void rotate3 (void)
{
  int x,y;
  int bx, by;

  for (by=0; by<2048; by+=16)
  for (bx=0; bx<2048; bx+=16)
  for (y=0; y<16; y++)
  for (x=0; x<16; x++)
    temp2[2048*(y+by)+x+bx] = temp1[2048*(x+bx)+y+by];
}


int main (int argc, char **args)
{
  int i, t1;

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate1();
  printf ("%d
", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate2();
  printf ("%d
", GetTickCount()-t1);

  t1 = GetTickCount();
  for (i=0; i<20; i++) rotate3();
  printf ("%d
", GetTickCount()-t1);

}

这篇关于如何在代码中有效地旋转位图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆