在Python中读取和切片二进制数据文件的最快方法 [英] Fastest way to read in and slice binary data files in Python

查看:634
本文介绍了在Python中读取和切片二进制数据文件的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个处理脚本,该脚本旨在提取"uint16"类型的二进制数据文件,并一次以6400的块进行各种处理.该代码最初是用Matlab编写的,但是由于分析代码是用Python编写的,因此我们希望通过用Python完成所有工作来简化流程.问题是我注意到我的Python代码比Matlab的fread函数要慢得多.

I have a processing script that is designed to pull in binary data files of type "uint16" and do various processing in chunks of 6400 at a time. The code was originally written in Matlab, but because the analysis codes are written in Python we wanted to streamline the process by having everything done in Python. The problem is i've noticed that my Python code is sufficiently slower than Matlab's fread function.

因此将Matlab代码简单地放在:

Simply put the Matlab code is thus:

fid = fopen(filename); 
frame = reshape(fread(fid,80*80,'uint16'),80,80);  

我的Python代码很简单:

While my Python code is simply:

with open(filename, 'rb') as f: 
    frame = np.array(unpack("H"*6400, f.read(12800))).reshape(80, 80).astype('float64')

文件大小从500 MB-> 400 GB变化很大,因此我相信找到一种更快的Python解析数据方法可以使较大的文件受益. 500 MB通常具有〜50000块,并且该数目随文件大小线性增加.我看到的速度差异大致为:

The file size varies heavily from 500 MB -> 400 GB so i believe finding a faster way of parsing the data in Python could pay dividends on the larger files. A 500 MB typically has ~50000 chunks and this number increases linearly with file size. The speed difference i am seeing is roughly:

Python = 4 x 10^-4 seconds / chunk

Matlab = 6.5 x 10^-5 seconds / chunk

处理过程显示,随着时间的流逝,Matlab比我实现的Python方法快约5倍.我已经探索了诸如numpy.fromfile和numpy.memmap之类的方法,但是由于这些方法需要在某个时候将整个文件打开到内存中,因此由于我的二进制文件非常大,因此它限制了使用情况.有一些pythonic的方法来做到这一点,我想念吗?我本以为Python在打开+读取二进制文件方面会异常快.任何意见是极大的赞赏.

The processing shows over time Matlab is ~5x faster than Python's method i've implemented. I have explored methods such as numpy.fromfile and numpy.memmap, but because these methods require opening the entire file into memory at some point, it limits the use case as my binary files are quite large. Is there some pythonic method for doing this that i am missing? I would have thought Python would be exceptionally fast at opening + reading binary files. Any advice is greatly appreciated.

推荐答案

将大块写入文件:

In [117]: dat = np.random.randint(0,1028,80*80).astype(np.uint16)
In [118]: dat.tofile('test.dat')
In [119]: dat
Out[119]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

以您的方式导入:

In [120]: import struct
In [121]: with open('test.dat','rb') as f:
     ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...:     
In [122]: frame
Out[122]: array([266, 776, 458, ..., 519,  38, 840])

使用fromfile

In [124]: np.fromfile('test.dat',count=6400,dtype=np.uint16)
Out[124]: array([266, 776, 458, ..., 519,  38, 840], dtype=uint16)

比较时间:

In [125]: %%timeit
     ...:  with open('test.dat','rb') as f:
     ...:      ...:     frame = np.array(struct.unpack("H"*6400,f.read(12800)))
     ...: 
1000 loops, best of 3: 898 µs per loop

In [126]: timeit np.fromfile('test.dat',count=6400,dtype=np.uint16)
The slowest run took 5.41 times longe....
10000 loops, best of 3: 36.6 µs per loop

fromfile更快.

在没有np.array的情况下,struct.unpack的时间为266 µs;仅针对f.read,23.因此,是unpack加上更通用,更强大的np.array所花费的时间更长.文件读取本身不是问题. (np.array可以处理多种输入,列表列表,对象列表等,因此必须花费更多时间来解析和评估输入.)

Time for the struct.unpack, without np.array is 266 µs; for just the f.read, 23. So it's the unpack plus the more general and robust np.array that take so much longer. File read, itself, is not a problem. (np.array can handle many kinds of input, lists of lists, lists of objects, etc, so has to spend more time parsing and evaluating the inputs.)

fromfile上的一个稍快的变体是您的阅读加上frombuffer:

A slightly faster variant on fromfile is your read plus frombuffer:

In [133]: with open('test.dat','rb') as f:
     ...:      frame3 = np.frombuffer(f.read(12800),dtype=np.uint16)

这篇关于在Python中读取和切片二进制数据文件的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆