用memoryview读取二进制文件 [英] Reading a binary file with memoryview

查看:102
本文介绍了用memoryview读取二进制文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面的代码中读取了一个大文件,该文件具有特殊的结构-其中两个文件需要同时处理.我没有在文件中来回搜索,而是加载了包裹在memoryview调用

中的这两个块

with open(abs_path, 'rb') as bsa_file:
    # ...
    # load the file record block to parse later
    file_records_block = memoryview(bsa_file.read(file_records_block_size))
    # load the file names block
    file_names_block = memoryview(bsa_file.read(total_file_name_length))
    # close the file
file_records_index = names_record_index = 0
for folder_record in folder_records:
    name_size = struct.unpack_from('B', file_records_block, file_records_index)[0]
    # discard null terminator below
    folder_path = struct.unpack_from('%ds' % (name_size - 1),
        file_records_block, file_records_index + 1)[0]
    file_records_index += name_size + 1
    for __ in xrange(folder_record.files_count):
        file_name_len = 0
        for b in file_names_block[names_record_index:]:
            if b != '\x00': file_name_len += 1
            else: break
        file_name = unicode(struct.unpack_from('%ds' % file_name_len,
            file_names_block,names_record_index)[0])
        names_record_index += file_name_len + 1

该文件已正确解析,但是由于这是我第一次使用mamoryview接口,因此我不确定我是否正确.如空终止的c字符串所示,file_names_block组成.

  1. 我的把戏file_names_block[names_record_index:]是使用memoryview魔术还是我创建了一些n ^ 2切片?我需要在这里使用islice吗?
  2. 如所见,我只是手动寻找空字节,然后继续执行unpack_from.但是我读了如何在python中将字节字符串拆分为单独的字节,我可以使用cast()(文档?)在内存视图上-是否可以使用该方法(或其他技巧)以字节为单位拆分视图?我可以打电话给split('\x00')吗?这样会保留内存效率吗?

我希望能有一种正确的方法(在python 2中)的见识.

解决方案

对于以空字符结尾的字符串,memoryview不会给您带来任何好处,因为它们除了固定宽度的数据外没有其他功能.您也可以在这里改用bytes.split()

:

file_names_block = bsa_file.read(total_file_name_length)
file_names = file_names_block.split(b'\00')

切片memoryview不会使用额外的内存(除了view参数),但是如果使用强制转换,则在尝试访问序列中的元素时确实会为解析的内存区域生成新的本机对象./p>

您仍然可以使用memoryview进行file_records_block解析;这些字符串以长度为前缀,从而使您有机会使用切片.在处理folder_path值时,只需对内存视图的字节进行切片,就无需保留索引:

for folder_record in folder_records:
    name_size = file_records_block[0]  # first byte is the length, indexing gives the integer
    folder_path = file_records_block[1:name_size].tobytes()
    file_records_block = file_records_block[name_size + 1:]  # skip the null

由于memoryview来自bytes对象,因此索引将为您提供字节的整数值,给定切片上的.tobytes()为该部分提供了新的bytes字符串,您可以然后继续切片,将剩余部分留给下一个循环.

I read a large file in the code below which has a special structure - among others two blocks that need be processed at the same time. Instead of seeking back and forth in the file I load these two blocks wrapped in memoryview calls

with open(abs_path, 'rb') as bsa_file:
    # ...
    # load the file record block to parse later
    file_records_block = memoryview(bsa_file.read(file_records_block_size))
    # load the file names block
    file_names_block = memoryview(bsa_file.read(total_file_name_length))
    # close the file
file_records_index = names_record_index = 0
for folder_record in folder_records:
    name_size = struct.unpack_from('B', file_records_block, file_records_index)[0]
    # discard null terminator below
    folder_path = struct.unpack_from('%ds' % (name_size - 1),
        file_records_block, file_records_index + 1)[0]
    file_records_index += name_size + 1
    for __ in xrange(folder_record.files_count):
        file_name_len = 0
        for b in file_names_block[names_record_index:]:
            if b != '\x00': file_name_len += 1
            else: break
        file_name = unicode(struct.unpack_from('%ds' % file_name_len,
            file_names_block,names_record_index)[0])
        names_record_index += file_name_len + 1

The file is correctly parsed, but as it's my first use of the mamoryview interface I am not sure I do it right. The file_names_block is composed as seen by null terminated c strings.

  1. Is my trick file_names_block[names_record_index:] using the memoryview magic or do I create some n^2 slices ? Would I need to use islice here ?
  2. As seen I just look for the null byte manually and then proceed to unpack_from. But I read in How to split a byte string into separate bytes in python that I can use cast() (docs ?) on the memory view - any way to use that (or another trick) to split the view in bytes ? Could I just call split('\x00') ? Would this preserve the memory efficiency ?

I would appreciate insight on the one right way to do this (in python 2).

解决方案

A memoryview is not going to give you any advantages when it comes to null-terminated strings as they have no facilities for anything but fixed-width data. You may as well use bytes.split() here instead:

file_names_block = bsa_file.read(total_file_name_length)
file_names = file_names_block.split(b'\00')

Slicing a memoryview doesn't use extra memory (other than the view parameters), but if using a cast you do produce new native objects for the parsed memory region the moment you try to access elements in the sequence.

You can still use the memoryview for the file_records_block parsing; those strings are prefixed by a length giving you the opportunity to use slicing. Just keep slicing bytes of the memory view as you process folder_path values, there's no need to keep an index:

for folder_record in folder_records:
    name_size = file_records_block[0]  # first byte is the length, indexing gives the integer
    folder_path = file_records_block[1:name_size].tobytes()
    file_records_block = file_records_block[name_size + 1:]  # skip the null

Because the memoryview was sourced from a bytes object, indexing will give you the integer value for a byte, .tobytes() on a given slice gives you a new bytes string for that section, and you can then continue to slice to leave the remainder for the next loop.

这篇关于用memoryview读取二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆