读取二进制文件时提高速度 [英] improve speed when reading a binary file

查看:106
本文介绍了读取二进制文件时提高速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的二进制文件,要在数组中读取.二进制文件的格式为:

I have a large binary file that I want to read in an array. The format of the binary files is:

  • 在我未使用的每一行的开头和结尾处有一个4字节的额外数据;
  • 我之间有8个字节值

我正在这样做:

        # nlines - number of row in the binary file
        # ncols - number of values to read from a row

        fidbin=open('toto.mda' ,'rb'); #open this file
        temp = fidbin.read(4)  #skip the first 4 bytes
        nvalues = nlines * ncols   # Total number of values

        array=np.zeros(nvalues,dtype=np.float)

        #read ncols values per line and skip the useless data at the end
        for c in range(int(nlines)): #read the nlines of the *.mda file
            matrix = np.fromfile(fidbin, np.float64,count=int(ncols)) #read all the values from one row
            Indice_start = c*ncols
            array[Indice_start:Indice_start+ncols]=matrix
            fidbin.seek( fidbin.tell() + 8) #fid.tell() the actual read position + skip bytes (4 at the end of the line + 4 at the beginning of the second line)
       fidbin.close()

它运作良好,但问题是对于大型二进制文件来说非常慢.有没有办法提高二进制文件的读取速度?

It works well but the problem is that is very slow for large binary file. Is there a way to increase the reading speed of the binary file?

推荐答案

您可以使用结构化数据类型并通过一次调用

You can use a structured data type and read the file with a single call to numpy.fromfile. For example, my file qaz.mda has five columns of floating point values between the four byte markers at the start and end of each row. Here's how you can create a structured data type and read the data.

首先,创建与每一行的格式匹配的数据类型:

First, create a data type that matches the format of each row:

In [547]: ncols = 5

In [548]: dt = np.dtype([('pre', np.int32), ('data', np.float64, ncols), ('post', np.int32)])

将文件读入结构化数组:

Read the file into a structured array:

In [549]: a = np.fromfile("qaz.mda", dtype=dt)

In [550]: a
Out[550]: 
array([(1, [0.0, 1.0, 2.0, 3.0, 4.0], 0),
       (2, [5.0, 6.0, 7.0, 8.0, 9.0], 0),
       (3, [10.0, 11.0, 12.0, 13.0, 14.0], 0),
       (4, [15.0, 16.0, 17.0, 18.0, 19.0], 0),
       (5, [20.0, 21.0, 22.0, 23.0, 24.0], 0)], 
      dtype=[('pre', '<i4'), ('data', '<f8', (5,)), ('post', '<i4')])

仅提取所需的数据:

In [551]: data = a['data']

In [552]: data
Out[552]: 
array([[  0.,   1.,   2.,   3.,   4.],
       [  5.,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.]])


您还可以尝试使用 numpy.memmap 来查看它是否可以提高性能:


You could also experiment with numpy.memmap to see if it improves performance:

In [563]: a = np.memmap("qaz.mda", dtype=dt)

In [564]: a
Out[564]: 
memmap([(1, [0.0, 1.0, 2.0, 3.0, 4.0], 0),
       (2, [5.0, 6.0, 7.0, 8.0, 9.0], 0),
       (3, [10.0, 11.0, 12.0, 13.0, 14.0], 0),
       (4, [15.0, 16.0, 17.0, 18.0, 19.0], 0),
       (5, [20.0, 21.0, 22.0, 23.0, 24.0], 0)], 
      dtype=[('pre', '<i4'), ('data', '<f8', (5,)), ('post', '<i4')])

In [565]: data = a['data']

In [566]: data
Out[566]: 
memmap([[  0.,   1.,   2.,   3.,   4.],
       [  5.,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.]])

请注意,上面的 data 仍然是一个内存映射的数组.为了确保将数据复制到内存中的数组,可以使用 numpy.copy :

Note that data above is still a memory-mapped array. To ensure that the data is copied to an array in memory, numpy.copy can be used:

In [567]: data = np.copy(a['data'])

In [568]: data
Out[568]: 
array([[  0.,   1.,   2.,   3.,   4.],
       [  5.,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.]])

是否必要取决于您如何在其余代码中使用数组.

Whether or not that is necessary depends on how you will use the array in the rest of your code.

这篇关于读取二进制文件时提高速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆