从Python中的二进制文件中提取特定字节 [英] Extract specific bytes from a binary file in Python

查看:192
本文介绍了从Python中的二进制文件中提取特定字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的二进制文件,带有y个传感器的x个int16数据点,以及带有一些基本信息的标头.对于每个采样时间(最多x个采样),将二进制文件写为y值,然后再写入另一组读数,依此类推.如果我想要所有数据,我正在使用numpy.fromfile(),它确实非常好且快速.但是,如果我只想要传感器数据的子集或只想要特定的传感器,则我目前有一个可怕的双for循环,使用file.seek()file.read()struct.unpack()会花费很多时间.还有另一种方法可以在python中更快地做到这一点吗?也许与mmap()哪个我不太了解?还是只使用整个fromfile()然后进行二次采样?

I have very large binary files with x number of int16 data points for y sensors, along with headers with some basic info. The binary file is written as y values for each sample time up to x samples, then another set of readings and so on. If I want all of the data, I am using numpy.fromfile() which works really nice and fast. However, if I only want a subset of sensor data or only specific sensors, I currently have a hideous double for loop, using file.seek(), file.read(), and struct.unpack() that takes forever. Is there another way to do this faster in python? perhaps with mmap() which I do not understand well? or just using the whole fromfile() and then subsampling?

data = numpy.empty(num_pts, sensor_indices)
for i in range(num_pts):
    for j in range(sensor_indices):
        curr_file.seek(bin_offsets[j])
        data_binary = curr_file.read(2)
        data[j][i] = struct.unpack('h', data_binary)[0]

遵循@rrauenza在mmap上的建议,这是非常有用的信息,我将代码编辑为

Having followed advice from @rrauenza on mmap, which was great info, I edited the code to be

mm = mmap.mmap(curr_file.fileno(), 0, access=mmap.ACCESS_READ)
data = numpy.empty(num_pts,sensor_indices)
for i in range(num_pts):
    for j in range(len(sensor_indices)):
        offset += bin_offsets[j] * 2
        data[j][i] = struct.unpack('h', mm[offset:offset+2])[0]

虽然这个速度比以前快,但仍然比以前慢了几个数量级

while this IS faster than before, it's still orders of magnitude slower than

shape = (x, y)
data = np.fromfile(file=self.curr_file, dtype=np.int16).reshape(shape)
data = data.transpose()
data = data[sensor_indices, :]
data = data[:, range(num_pts)]

我用一个较小的30 Mb文件(只有16个传感器和30秒的数据)进行了测试.原始代码为160 s,mmap为105 s,np.fromfile和二次采样为0.33 s.

I tested this with a smaller 30 Mb file that is only 16 sensors with 30 seconds of data. Original code was 160 s, mmap was 105 s, and np.fromfile and subsampling was 0.33 s.

剩下的问题是-显然,对于较小的文件,使用numpy.fromfile()更好,但是会出现更大的文件问题,这些文件可能需要20 Gb的数据才能存储数小时或数天的数据,并且最多可以使用500个传感器?

Remaining question is - Clearly using numpy.fromfile() is better with the small file, but will there be issues with much larger files that may be up 20 Gb with hours or days of data and up to 500 sensors?

推荐答案

我肯定会尝试mmap():

https://docs.python.org/2/library/mmap.html

您正在阅读很多细小的地方,如果您有很多系统调用开销,正在为您提取的每个int16调用seek()read().

You're reading a lot of small bits which has a lot of system call overhead if you're calling seek() and read() for every int16 you are extracting.

我写了一个小测试来演示:

I've written a small test to demonstrate:

#!/usr/bin/python

import mmap
import os
import struct
import sys

FILE = "/opt/tmp/random"  # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10


def byfile():
    sum = 0
    with open(FILE, "r") as fd:
        for offset in range(0, SIZE/BYTES, SKIP*BYTES):
            fd.seek(offset)
            data = fd.read(BYTES)
            sum += struct.unpack('h', data)[0]
    return sum


def bymmap():
    sum = 0
    with open(FILE, "r") as fd:
        mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
        for offset in range(0, SIZE/BYTES, SKIP*BYTES):
            data = mm[offset:offset+BYTES]
            sum += struct.unpack('h', data)[0]
    return sum


if sys.argv[1] == 'mmap':
    print bymmap()

if sys.argv[1] == 'file':
    print byfile()

我对每种方法都运行了两次以补偿缓存.我使用time是因为我想测量usersys时间.

I ran each method twice to compensate for caching. I used time because I wanted to measure user and sys time.

以下是结果:

[centos7:/tmp]$ time ./test.py file
-211990391

real    0m44.656s
user    0m35.978s
sys     0m8.697s
[centos7:/tmp]$ time ./test.py file
-211990391

real    0m43.091s
user    0m37.571s
sys     0m5.539s
[centos7:/tmp]$ time ./test.py mmap
-211990391

real    0m16.712s
user    0m15.495s
sys     0m1.227s
[centos7:/tmp]$ time ./test.py mmap
-211990391

real    0m16.942s
user    0m15.846s
sys     0m1.104s
[centos7:/tmp]$ 

(总和-211990391验证两个版本的功能相同.)

(The sum -211990391 just validates both versions do the same thing.)

查看每个版本的第二个结果,mmap()大约是实时的1/3.用户时间约为1/2,系统时间约为1/5.

Looking at each version's 2nd result, mmap() is ~1/3rd of the real time. User time is ~1/2 and system time is ~1/5th.

您可能会加快速度的其他选择是:

Your other options for perhaps speeding this up are:

(1)如前所述,加载整个文件.大型I/O代替小型I/O可以加快速度.但是,如果超出系统内存,您将退回到分页,这将比mmap()差(因为您必须分页).我对这里不是很有希望,因为mmap已经在使用较大的I/O.

(1) As you mentioned, load the whole file. The large I/O's instead of the small I/O's could speed things up. If you exceed system memory, though, you'll fall back to paging, which will be worse than mmap() (since you have to page out). I'm not super hopeful here because mmap is already using larger I/O's.

(2)并发. 也许通过多个线程并行读取文件可以加快速度,但是您将拥有Python 多重处理可以避免GIL,从而更好地工作,并且您可以轻松地传递数据回到顶层处理程序.但是,这将与下一个位置(局部性)相抵触:您可以使I/O更加随机.

(2) Concurrency. Maybe reading the file in parallel through multiple threads could speed things up, but you'll have the Python GIL to deal with. Multiprocessing will work better by avoiding the GIL, and you could easily pass your data back to a top level handler. This will, however, work against the next item, locality: You might make your I/O more random.

(3)位置.以某种方式组织您的数据(或对读取进行排序),以使您的数据更紧密地结合在一起. mmap()根据系统页面大小对文件进行分页:

(3) Locality. Somehow organize your data (or order your reads) so that your data is closer together. mmap() pages the file in chunks according to the system pagesize:

>>> import mmap
>>> mmap.PAGESIZE
4096
>>> mmap.ALLOCATIONGRANULARITY
4096
>>> 

如果您的数据距离更近(在4k块内),那么它们将已经被加载到缓冲区缓存中.

If your data is closer together (within the 4k chunk), it will already have been loaded into the buffer cache.

(4)更好的硬件.就像SSD一样.

(4) Better hardware. Like an SSD.

我确实在SSD上运行了它,并且速度要快得多.我运行了python的配置文件,想知道解压缩是否昂贵.不是:

I did run this on an SSD and it was much faster. I ran a profile of the python, wondering if the unpack was expensive. It's not:

$ python -m cProfile test.py mmap                                                                                                                        
121679286
         26843553 function calls in 8.369 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    6.204    6.204    8.357    8.357 test.py:24(bymmap)
        1    0.012    0.012    8.369    8.369 test.py:3(<module>)
 26843546    1.700    0.000    1.700    0.000 {_struct.unpack}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {posix.stat}
        1    0.453    0.453    0.453    0.453 {range}

附录:

好奇心得到了我最好的帮助,我尝试了multiprocessing.我需要仔细查看分区,但是在各个试验中,解压缩的次数(53687092)是相同的:

Curiosity got the best of me and I tried out multiprocessing. I need to look at my partitioning closer, but the number of unpacks (53687092) is the same across trials:

$ time ./test2.py 4
[(4415068.0, 13421773), (-145566705.0, 13421773), (14296671.0, 13421773), (109804332.0, 13421773)]
(-17050634.0, 53687092)

real    0m5.629s
user    0m17.756s
sys     0m0.066s
$ time ./test2.py 1
[(264140374.0, 53687092)]
(264140374.0, 53687092)

real    0m13.246s
user    0m13.175s
sys     0m0.060s

代码:

#!/usr/bin/python

import functools
import multiprocessing
import mmap
import os
import struct
import sys

FILE = "/tmp/random"  # dd if=/dev/random of=/tmp/random bs=1024k count=1024
SIZE = os.stat(FILE).st_size
BYTES = 2
SKIP = 10


def bymmap(poolsize, n):
    partition = SIZE/poolsize
    initial = n * partition
    end = initial + partition
    sum = 0.0
    unpacks = 0
    with open(FILE, "r") as fd:
        mm = mmap.mmap(fd.fileno(), 0, prot=mmap.PROT_READ)
        for offset in xrange(initial, end, SKIP*BYTES):
            data = mm[offset:offset+BYTES]
            sum += struct.unpack('h', data)[0]
            unpacks += 1
    return (sum, unpacks)


poolsize = int(sys.argv[1])
pool = multiprocessing.Pool(poolsize)
results = pool.map(functools.partial(bymmap, poolsize), range(0, poolsize))
print results
print reduce(lambda x, y: (x[0] + y[0], x[1] + y[1]), results)

这篇关于从Python中的二进制文件中提取特定字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆