读取二进制平面文件并跳过字节 [英] Read binary flatfile and skip bytes

查看:92
本文介绍了读取二进制平面文件并跳过字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个二进制文件,该文件具有组织为400个字节组的数据.我想从位置304到位置308的字节之间建立一个np.uint32类型的数组.但是,我找不到NumPy提供的方法让我选择要读取的字节,只是一个在numpy.fromfile中定义的初始偏移量.

I have a binary file that has data organized into 400 byte groups. I want to build an array of type np.uint32 from bytes at position 304 to position 308. However, I cannot find a method provided by NumPy that lets me select which bytes to read, only an initial offset as defined in numpy.fromfile.

例如,如果我的文件包含1000个400字节的组,则我需要一个大小为1000的数组,如下所示:

For example, if my file contains 1000 groups of 400 bytes, I need an array of size 1000 such that:

arr[0] = bytes 304-308
arr[1] = bytes 704-708
...
arr[-1] = bytes 399904 - 399908

是否有NumPy方法可以让我指定要从缓冲区读取的字节?

Is there a NumPy method that would allow me to specify which bytes to read from a buffer?

推荐答案

另一种方式(稍微)改写您要查找的内容,就是说您要读取从偏移量304开始的uint32数字,跨度为400字节 np.fromfile 不提供要插入的参数自定义步幅(尽管可能应该如此).您有两种不同的选择.

Another way to rephrase what you are looking for (slightly), is to say you want to read uint32 numbers starting at offset 304, with a stride of 400 bytes. np.fromfile does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.

最简单的方法可能是加载整个文件,并将所需的列作为子集:

The simplest is probably to load the entire file and subset the column you want:

data = np.fromfile(filename, dtype=np.uint32)[304 // 4::400 // 4].copy()

如果您想进一步控制字节的确切位置(例如,如果偏移量或块大小不是4的倍数),则可以改用结构化数组:

If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:

dt = np.dtype([('_1', 'u1', 304), ('data', 'u4'), ('_2', 'u1', 92)])
data = np.fromfile(filename, dtype=dt)['data'].copy()

此处,_1_2用于丢弃分辨率为1字节而不是4的不需要的字节.

Here, _1 and _2 are used to discard the unneeded bytes with 1-byte resolution rather than 4.

加载整个文件通常比在两次读取之间查找要快得多,因此,这些方法对于适合内存的文件可能是理想的.如果不是这种情况,则可以使用内存映射或完全自主开发的解决方案.

Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.

内存映射可以通过Pythons mmap 模块实现,并且使用 ndarray 中>参数,也可以使用 np.memmap 为您代劳的课程:

Memory maps can be implemented via Pythons mmap module, and wrapped in an ndarray using the buffer parameter, or you can use the np.memmap class that does it for you:

mm = np.memmap(filename, dtype=np.uint32, mode='r', offset=0, shape=(1000, 400 // 4))
data = np.array(mm[:, 304 // 4])
del mm

使用原始的mmap可能会更有效率,因为您可以指定直接进入地图的步幅和偏移量,而跳过所有其他数据.这样做也更好,因为您可以使用不为np.uint32大小的倍数的偏移量和步幅:

Using a raw mmap is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of a np.uint32:

with open(filename, 'rb') as f, mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
    data = np.ndarray(buffer=mm, dtype=np.uint32, offset=304, strides=400, shape=1000).copy()

最后一次调用copy是必需的,因为一旦关闭内存映射,则底层缓冲区将失效,从而可能导致段错误.

The final call to copy is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.

这篇关于读取二进制平面文件并跳过字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆