使用自定义行终止符读取二进制文件中的大文件,并在python中编写较小的块 [英] reading a big file in binary with custom line terminator and writing in smaller chunks in python

查看:54
本文介绍了使用自定义行终止符读取二进制文件中的大文件,并在python中编写较小的块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用 \ x01 作为行终止符的文件.那是行终止符不是换行符,而是 001 的字节值.此处是它的ascii表示形式哪个 ^ A .

I have a file that uses \x01 as line terminator. That is line terminator is NOT newline but the bytevalue of 001. Here is the ascii representation for it which ^A.

我想将文件分割成10 MB的大小.这是我想出的

I want to split file to size of 10 MB each. Here is what I came up with

size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
    ofile = open("output0.txt","wb")
    data = ifile.read(size)
        while data:
            ofile.write(data)
            ofile.close()
            data = ifile.read(size)
            i+=1 
            ofile = open("output%d.txt"%(i),"wb")


    ofile.close()

但是,这将导致文件在任意位置损坏.我希望文件仅以 001 的字节值终止,并且下一个读取从下一个字节恢复.

However, this would result in files that are broken at arbitrary places. I want the files to be terminated only at the byte value of 001 and next read resumes from the next byte.

推荐答案

如果只是一个字节的终端,您可以执行类似的操作

if its just one byte terminal you can do something like

def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
    return "".join(iter(lambda:f_object.read(1),terminal_byte))

然后创建一个辅助函数,该函数将读取文件中的所有行

then make a helper function that will read all the lines in a file

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

然后创建一个将其分块的函数

then make a function that will chunk it up

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield "".join(current_chunk)

然后做类似的事情

with open("my_binary.dat","rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
        with open("out%d.dat"%i,"wb") as f_out:
            f_out.write(chunk)

也许可以通过某种方式(或者甚至是一种很棒的内置方式)使用库来执行此操作,但是我不知道有任何副手

there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand

这篇关于使用自定义行终止符读取二进制文件中的大文件,并在python中编写较小的块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆