使用自定义行终止符读取二进制文件中的大文件,并在python中编写较小的块 [英] reading a big file in binary with custom line terminator and writing in smaller chunks in python
问题描述
我有一个使用 \ x01
作为行终止符的文件.那是行终止符不是换行符,而是 001
的字节值.此处是它的ascii表示形式哪个 ^ A
.
I have a file that uses \x01
as line terminator. That is line terminator is NOT newline but the bytevalue of 001
. Here is the ascii representation for it which ^A
.
我想将文件分割成10 MB的大小.这是我想出的
I want to split file to size of 10 MB each. Here is what I came up with
size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
ofile = open("output0.txt","wb")
data = ifile.read(size)
while data:
ofile.write(data)
ofile.close()
data = ifile.read(size)
i+=1
ofile = open("output%d.txt"%(i),"wb")
ofile.close()
但是,这将导致文件在任意位置损坏.我希望文件仅以 001
的字节值终止,并且下一个读取从下一个字节恢复.
However, this would result in files that are broken at arbitrary places.
I want the files to be terminated only at the byte value of 001
and next read resumes from the next byte.
推荐答案
如果只是一个字节的终端,您可以执行类似的操作
if its just one byte terminal you can do something like
def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
return "".join(iter(lambda:f_object.read(1),terminal_byte))
然后创建一个辅助函数,该函数将读取文件中的所有行
then make a helper function that will read all the lines in a file
def read_lines(f_object,terminal_byte):
tmp = read_line(f_object,terminal_byte)
while tmp:
yield tmp
tmp = read_line(f_object,terminal_byte)
然后创建一个将其分块的函数
then make a function that will chunk it up
def make_chunks(f_object,terminal_byte,max_size):
current_chunk = []
current_chunk_size = 0
for line in read_lines(f_object,terminal_byte):
current_chunk.append(line)
current_chunk_size += len(line)
if current_chunk_size > max_size:
yield "".join(current_chunk)
current_chunk = []
current_chunk_size = 0
if current_chunk:
yield "".join(current_chunk)
然后做类似的事情
with open("my_binary.dat","rb") as f_in:
for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
with open("out%d.dat"%i,"wb") as f_out:
f_out.write(chunk)
也许可以通过某种方式(或者甚至是一种很棒的内置方式)使用库来执行此操作,但是我不知道有任何副手
there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand
这篇关于使用自定义行终止符读取二进制文件中的大文件,并在python中编写较小的块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!