更快的位级数据打包 [英] Faster bit-level data packing

查看:178
本文介绍了更快的位级数据打包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

连接到Raspberry Pi(零W)的256 * 64像素OLED显示器将4位灰度像素数据打包成一个字节(即每字节两个像素),因此总共8192个字节。例如。字节

  0a 0b 0c 0d(只有较低的半字节有数据)

成为

  ab cd 

转换这些字节要么获得 cairo ImageSurface 天真地迭代像素数据时需要达到0.9秒,具体取决于颜色深度。



从枕头L(单色8位)组合每两个字节图像:

  imd = im.tobytes()
nibbles = [int(p / 16)for p in imd]
packed = []
msn =无
for n in nibbles:
nib = n&如果msn不是None,则为0x0F

b = msn<< 4 | nib
packed.append(b)
msn =无
else:
msn = nib

这(省略状态并保存浮点/整数转换)将其降低到大约一半(0.2秒):

  packed = [] 
,范围内的b(0,256 * 64,2):
packed.append((imd [b] // 16)<< 4 |(imd [b + 1] // 16))

基本上第一个应用于RGB24(32位!)cairo ImageSurface,但使用粗灰度转换:

  mv = surface.get_data( )
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv) ),w,h,f,s)

#convert xRGB
o = []
msn =无
for p in range(0,len(mv) ,4):
nib = int((mv [p + 1] + mv [p + 2] + mv [p + 3])/ 3/16)&如果msn不是None,则为0x0F

b = msn<< 4 | nib
o.append(b)
msn =无
else:
msn = nib

需要大约两倍的时间(0.9秒对0.4秒)。



struct 模块不支持半字节(半字节)。



bitstring 允许打包半字节:

 >>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>

但似乎没有一种方法可以快速将其解压缩为整数列表 - 这个需要30秒!:

  a = bitstring.BitStream()
for p in imd:
a .append(bitstring.Bits(uint = p // 16,length = 4))

packed = []
a.pos = 0
for p in range(256 * 64 // 2):
packed.append(a.read(8).uint)

Python 3是否有办法有效地执行此操作,还是需要替代方法?
用ctypes包裹的外包装工?与Cython 相同但更简单(我还没有研究过这些)?看起来非常好,请参阅我的答案。

解决方案

从200毫秒到只需将循环包装在一个函数中

  def packer0(imd) :
def中的相同循环
packed = []
,范围内的b(0,256 * 64,2):
packed.append( (imd [b] // 16)<< 4 |(imd [b + 1] // 16))
返回打包

Cythonizing 相同的代码

  def packer1(imd):
Cythonize python nibble packing loop
packed = []
for b in range(0,256 * 64,2):
packed.append((imd [b] // 16)<< 4 |(imd [b + 1] // 16))
返回打包

类型

下降至16毫秒

  def packer2(imd):
Cythonize python nibble打包循环,键入
packed = []
cdef unsigned int b
b范围内的b( 0,256 * 64,2):
packed.append((imd [b] // 16)<< 4 | (imd [b + 1] // 16))
返回打包

不多与简化循环的区别

  def packer3(imd):
Cythonize python半字节打包循环,键入
packed = []
cdef unsigned int i
for i in range(256 * 64/2):
packed.append((imd [i * 2] // 16)<< 4 |(imd [i * 2 + 1] // 16))
返回打包

甚至可能更快一点(15毫秒)

  def packer4(it) :
Cythonize python nibble打包循环,键入
cdef unsigned int n = len(it)// 2
cdef unsigned int i
return [(it [i * 2] // 16)<< 4 |它[i * 2 + 1] // 16 for i in range(n)]

这里是 timeit

 >>> timeit.timeit('packer4(data)',setup ='from pack import packer4; data = [0] * 256 * 64',number = 100)
1.31725951000044
>>>从包导入packer4退出()
pi @ raspberrypi:〜$ python3 -m timeit -s'; data = [0] * 256 * 64''packer4(数据)'
100循环,最佳3:每循环9.04毫秒

这已经满足了我的要求,但我想输入/输出可迭代( - > unsigned int array?)或使用更宽的数据类型访问输入数据可能会进一步优化( Raspbian是32位, BCM2835 是ARM1176JZF-S单 - 核心)。



或者 GPU上的并行性或多核Raspberry Pis。






与C中相同循环的粗略比较(< a href =https://ideone.com/cpoo3z =nofollow noreferrer> ideone ):

  #include< stdio.h> 
#include< stdint.h>
#define SIZE(256 * 64)
int main(void){
uint8_t in [SIZE] = {0};
uint8_t out [SIZE / 2] = {0};
uint8_t t;
for(t = 0; t <100; t ++){
uint16_t i;
for(i = 0; i< SIZE / 2; i ++){
out [i] =(in [i * 2] / 16)<< 4>在[I * 2 + 1] / 16;
}
}
返回0;
}

它显然快100倍:

  pi @ raspberry:〜$ gcc pc 
pi @ raspberry:〜$ time ./a.out

real 0m0.085s
用户0m0.060s
sys 0m0.010s






消除转移/除法可能是另一个轻微的优化(我没有检查结果C,也没有检查二进制):

  def pack(字节):
Cythonize python半字节打包循环,键入
cdef unsigned int n = len(it)// 2
cdef unsigned int i
返回[(([[<< 1]& 0xF0)|([[i <1)+1]>> 4))i在范围内( n)]

结果



<$ p $来自pack import pack的p> python3 -m timeit -s'; data = bytes([0] * 256 * 64)''pack(data)'
100循环,最佳3:12.7毫秒每循环
python3 -m timeit -s'来自pack import packs; data = bytes([0] * 256 * 64)''pack(data)'
100循环,最佳3:12毫秒每循环
python3 -m timeit -s'来自pack import packs; data = bytes([0] * 256 * 64)''pack(data)'
100循环,最佳3:11毫秒每循环
python3 -m timeit -s'来自pack import pack; data = bytes([0] * 256 * 64)''pack(data)'
100循环,最佳3:每循环13.9毫秒


An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes

0a 0b 0c 0d (only lower nibble has data)

become

ab cd

Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.

Combining every two bytes from a Pillow "L" (monochrome 8 bit) Image:

imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
    nib = n & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        packed.append(b)
        msn = None
    else:
        msn = nib

This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):

packed = []
for b in range(0, 256*64, 2):
    packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )

Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:

mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)

# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
    nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
    if msn is not None:
        b = msn << 4 | nib
        o.append(b)
        msn = None
    else:
        msn = nib

takes about twice as long (0.9 s vs 0.4 s).

The struct module does not support nibbles (half-bytes).

bitstring does allow packing nibbles:

>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>

But there does not seem to be a method to unpack this into a list of integers quickly -- this takes 30 seconds!:

a = bitstring.BitStream()
for p in imd:
    a.append( bitstring.Bits(uint=p//16, length=4) )

packed=[]
a.pos=0
for p in range(256*64//2):
    packed.append( a.read(8).uint )

Does Python 3 have the means to do this efficiently or do I need an alternative? External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.

解决方案

Down to 130 ms from 200 ms by just wrapping the loop in a function

def packer0(imd):
    """same loop in a def"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

Down to 35 ms by Cythonizing the same code

def packer1(imd):
    """Cythonize python nibble packing loop"""
    packed = []
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

Down to 16 ms with type

def packer2(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int b
    for b in range(0, 256*64, 2):
        packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
    return packed

Not much of a difference with a "simplified" loop

def packer3(imd):
    """Cythonize python nibble packing loop, typed"""
    packed = []
    cdef unsigned int i
    for i in range(256*64/2):
        packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
    return packed

Maybe a tiny bit faster even (15 ms)

def packer4(it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]

Here's with timeit

>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi@raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop

This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).

Or with parallelism on the GPU or the multi-core Raspberry Pis.


A crude comparison with the same loop in C (ideone):

#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
  uint8_t in[SIZE] = {0};
  uint8_t out[SIZE/2] = {0};
  uint8_t t;
  for(t=0; t<100; t++){
    uint16_t i;
    for(i=0; i<SIZE/2; i++){
        out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
    }
  }
  return 0;
}

It's apparently 100 times faster:

pi@raspberry:~ $ gcc p.c
pi@raspberry:~ $ time ./a.out

real    0m0.085s
user    0m0.060s
sys     0m0.010s


Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):

def packs(bytes it):
    """Cythonize python nibble packing loop, typed"""
    cdef unsigned int n = len(it)//2
    cdef unsigned int i
    return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]

results in

python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop

这篇关于更快的位级数据打包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆