在24位阵列图像数据中转换和转置32位图像数据的byte []的最快方法? [英] Fastest way to transform AND transpose byte[] of 32bit image data in 24bit array image data?

查看:94
本文介绍了在24位阵列图像数据中转换和转置32位图像数据的byte []的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有人可以考虑更好的解决方案来做这些事情。



我有一个32位格式的字节原始位图数据(RGBA)



我需要通过消除alpha分量,以24位格式转换此数组。



此外,我需要通过对齐结果数组第一部分中的红色成分,第一个1/3之后的绿色和之后的蓝色来转置数据。 2/3因为我之后正在进行RLE压缩。



所以不要使用看起来像R,G,B,R,G,B的数组,R ....我将有一个看起来像R,R,R,R的数组...... G,G,G,G ...... B,B,B,B



通过这种对齐,我有更好的RLE压缩,因为在像素之间的每个组件中有更多机会对齐。



我正在使用不安全和固定的指针来做到这一点。我得到类似10ms(~31000个tick)的东西来处理8.3M字节的数组长度。它似乎很低但是我正在进行实时屏幕抓取,每个ms都在计算...





I would like to know if someone can think about a better solution for doing this stuff.

I have an array of byte raw bitmap data in 32bit format (RGBA)

I need to transform this array in an 24bit format by eliminating the alpha component.

Additionally i need to transpose the data by aligning the red component in the first part of the result array, the green in after the first 1/3 and the blue after the 2/3 because i'm doing an RLE compression after.

So instead of having an array that look like R,G,B,R,G,B,R.... i will have an array that look like R,R,R,R....G,G,G,G.....B,B,B,B

By doing this alignment, i have a better RLE compression because there is more chance to have alignment in each component between pixels.

I'm using unsafe and fixed pointer to do that. I'm getting something like 10ms (~31000 ticks) to process for a 8.3M byte array length. It seems to be low but i m doing a realtime screengrabbing and every ms is counting...


byte[] d1 = new Byte[width*height*4];
// -- fill d1 with data
.....
// 
byte[] datain24format = new byte[width*heigth * 3];
unsafe
{
   fixed(byte* pd1=d1)
   {
      fixed(byte* pd24b=datain24format)
      {
         int* actualpointeur = (int*)pd1 + (datalenght / 4) - 1;

         byte* redindexp = (pd24b + (datain24format.Length / 3)) - 1;
         byte* greenindexp = (pd24b + (datain24format.Length / 3) * 2) - 1;
         byte* blueindexp = (pd24b + datain24format.Length) - 1;
         for (int i = datalenght - 1; i >= 0; i -= 4, actualpointeur--, redindexp--, greenindexp--, blueindexp--)
         {
               *redindexp = (byte)*(actualpointeur);
               *greenindexp = (byte)(*(actualpointeur) >> 8);
               *blueindexp = (byte)(*(actualpointeur) >> 16);
         }
    }
  }
}  









更新



i尝试循环展开(之前不知道此提示)16次操作(更多没有更好的表现)



我的结果现在是5ms(约15000滴答):)



这是我的新循环:







Update

i tried the loop unrolling (didn t know this tips before) by 16 operation (more doesn't make better performance)

My result is now 5ms (~15000 ticks) :)

here is my new loop :

for (int i = _DataLenght - 1; i >= 0; i -= 64,  actualpointeur-=16,redindexp-=16,greenindexp-=16,blueindexp-=16)
{
                                            
 *redindexp = (byte)*(actualpointeur);
 *greenindexp = (byte)(*(actualpointeur) >> 8);
 *blueindexp = (byte)(*(actualpointeur) >> 16);
 *(redindexp-1) = (byte)*(actualpointeur-1);
 *(greenindexp-1) = (byte)(*(actualpointeur-1) >> 8);
 *(blueindexp-1) = (byte)(*(actualpointeur-1) >> 16);
......
 *(redindexp - 15) = (byte)*(actualpointeur - 15);
 *(greenindexp - 15) = (byte)(*(actualpointeur - 15) >> 8);
 *(blueindexp - 15) = (byte)(*(actualpointeur - 15) >> 16);
                                           
}

推荐答案

看起来相当于我。传统上,我会尝试远离字节大小的指针并使用更本机(32或64)位并使用位操作,以最小化内存访问。也就是说,我不认为它在现代CPU上有很大的不同。



确保你使用的是发布版本!这将产生很大的不同。



我在处理音频时做过类似的事情。你可以做的一件事有所帮助,但不会产生很大的不同,就是在循环的每次传递中做更多的事情。每循环迭代执行十个像素,然后在第二个循环中执行任何操作。这有助于增加分支比率的工作。



始终使用局部变量而不是成员变量。这会产生很大的不同,但你已经在做这件事了。



确保你已经将线程/进程优先级设置为精神高的东西,但要小心你可以杀掉机器这样做。



如果你使用的是多核机器,我能想到的另一件事就是多线程。可以提供帮助,在单独的处理器上保持线程关联并分割工作。也就是说,由于各种CPU缓存的存在,它可能没有任何区别。



最后想到 - 买一台超级计算机。
Looks fairly to the point to me. Traditionally, I would try to stay away from byte sized pointers and use something more native (32 or 64) bit and use bit manipulation, to minimize the memory accesses. That said, I don't think it makes much difference on modern day CPUs.

Make sure you are using a release build! This will make a big difference.

I've done similar stuff when working with audio. One thing you could do which helps, but won't make a big difference is do more on each pass of the loop. Do ten pixels per loop iteration, then do anything left over in a second loop. This helps by increases the work done to branch ratio.

Always use local variables rather than member ones. This makes a huge difference, but you are already doing this.

Make sure you have set the thread/process priority to something mentally high, but be careful as you can kill the machine doing this.

The only other thing I can think of is going multithread, if you are on a multiple core machine. Could help, keep thread affinity on separate processors and split the work up. That said, because of the goings on with the various CPU caches, it might not make any difference.

Last thought - buy a super computer.


这篇关于在24位阵列图像数据中转换和转置32位图像数据的byte []的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆