在python中将大文件拆分为小文件时出现内存错误 [英] memory error when splitting big file into smaller files in python
问题描述
我已经阅读了几篇文章,包括这个一.但没有任何帮助.
这是我目前分割文件的python代码
我的输入文件大小为 15G,我将其拆分为 128MB.我的电脑有8G内存
导入系统def read_line(f_object,terminal_byte):行 = ''.join(iter(lambda:f_object.read(1),terminal_byte))行+="\x01"回程线def read_lines(f_object,terminal_byte):tmp = read_line(f_object,terminal_byte)而 tmp:产量tmptmp = read_line(f_object,terminal_byte)def make_chunks(f_object,terminal_byte,max_size):current_chunk = []current_chunk_size = 0对于 read_lines(f_object,terminal_byte) 中的行:current_chunk.append(line)current_chunk_size += len(line)如果 current_chunk_size >最大尺寸:产量 "".join(current_chunk)current_chunk = []current_chunk_size = 0如果 current_chunk:产量 ''.join(current_chunk)输入文件=sys.argv[1]使用 open(inputfile,"rb") 作为 f_in:对于 i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):用 open("out%d.txt"%i,"wb") 作为 f_out:f_out.write(块)
当我执行脚本时,出现以下错误:
回溯(最近一次调用最后一次): 中的文件splitter.py",第 30 行对于 i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):文件splitter.py",第 17 行,在 make_chunks 中对于 read_lines(f_object,terminal_byte) 中的行:文件splitter.py",第 12 行,在 read_lines 中tmp = read_line(f_object,terminal_byte)文件splitter.py",第 4 行,在 read_line 中行 = ''.join(iter(lambda:f_object.read(1),terminal_byte))内存错误
问题:将大文件拆分成小文件
不要在最后一个 chunk
中找到每个 \x01
.
要么将 Filepointer 重置为 Last found \x01
的 offset+1
并继续或写入当前块文件和剩余部分中的 offset
chunk
在下一个块文件中.
注意:您的 chunk_size
应该是 io.DEFAULT_BUFFER_SIZE
或它的倍数.
如果将 chunk_size
提高到高,则不会获得加速.
阅读此相关 SO QA:文件的默认缓冲区大小
我的例子展示了重置文件指针的用法,例如:
导入iolarge_data = b"""Lorem ipsum\x01dolor 坐\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01"""def split(chunk_size, split_size):使用 io.BytesIO(large_data) 作为 fh_in:_大小 = 0# 用于验证分块写入result_data = io.BytesIO()为真:块 = fh_in.read(chunk_size)打印('读取({})'.格式(字节数组(块)))如果不是块:打破_size += chunk_size如果 _size >= split_size:_大小 = 0# 在最后一个 0x01 处拆分l = len(块)打印('\tsplit_on_last_\\x01({})\t{}'.format(l, bytearray(chunk)))# 反向迭代对于范围内的 p (l-1, -1, -1):c = 块[p:p+1]如果 ord(c) == ord('\x01'):偏移量 = l-(p+1)# 条件如果 \x01 是块中的最后一个字节如果偏移 == 0:print('\toffset={} write({})\t\t{}'.format(offset, l - offset, bytearray(chunk)))result_data.write(块)别的:# 重置文件指针fh_in.seek(fh_in.tell()-offset)print('\toffset={} write({})\t\t{}'.format(offset, l-offset, bytearray(chunk[:-offset])))result_data.write(chunk[:-offset])休息别的:print('\twrite({}) {}'.format(chunk_size, bytearray(chunk)))result_data.write(块)打印('输入:{}\n输出:{}'.format(large_data, result_data.getvalue()))如果 __name__ == '__main__':拆分(块大小= 30,拆分大小= 60)
<块引用>
输出:
read(bytearray(b'Lorem ipsum\x01dolor sat\x01sadipsci'))write(30) bytearray(b'Lorem ipsum\x01dolor sat\x01sadipsci')读取(字节数组(b'ng elitr,sed\x01labore et\x01dolore'))split_on_last_\x01(30) bytearray(b'ng elitr, sed\x01labore et\x01dolore')offset=6 write(24) bytearray(b'ng elitr, sed\x01labore et\x01')读取(字节数组(b'dolores et ea rebum.\x01magna ali'))write(30) bytearray(b'dolores et ea rebum.\x01magna ali')读取(字节数组(b'quyam erat,\x01'))split_on_last_\x01(12) 字节数组(b'quyam erat,\x01')offset=0 write(12) bytearray(b'quyam erate,\x01')读取(字节数组(b''))输入:b'Lorem ipsum\x01dolor sat\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01'输出:b'Lorem ipsum\x01dolor sat\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01'
使用 Python 测试:3.4.2
I have read several posts including this one. but none helped.
Here is the python code that I have currently that splits the file
my input file size is 15G and I am splitting it into 128MB. my computer has 8G memory
import sys
def read_line(f_object,terminal_byte):
line = ''.join(iter(lambda:f_object.read(1),terminal_byte))
line+="\x01"
return line
def read_lines(f_object,terminal_byte):
tmp = read_line(f_object,terminal_byte)
while tmp:
yield tmp
tmp = read_line(f_object,terminal_byte)
def make_chunks(f_object,terminal_byte,max_size):
current_chunk = []
current_chunk_size = 0
for line in read_lines(f_object,terminal_byte):
current_chunk.append(line)
current_chunk_size += len(line)
if current_chunk_size > max_size:
yield "".join(current_chunk)
current_chunk = []
current_chunk_size = 0
if current_chunk:
yield ''.join(current_chunk)
inputfile=sys.argv[1]
with open(inputfile,"rb") as f_in:
for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):
with open("out%d.txt"%i,"wb") as f_out:
f_out.write(chunk)
when I execute the script, I get the following error:
Traceback (most recent call last):
File "splitter.py", line 30, in <module>
for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):
File "splitter.py", line 17, in make_chunks
for line in read_lines(f_object,terminal_byte):
File "splitter.py", line 12, in read_lines
tmp = read_line(f_object,terminal_byte)
File "splitter.py", line 4, in read_line
line = ''.join(iter(lambda:f_object.read(1),terminal_byte))
MemoryError
Question: splitting big file into smaller files
Instead of finding every single \x01
do this only in the Last chunk
.
Either reset the Filepointer to offset+1
of Last found \x01
and continue or write up to offset
in the Current Chunk File and the remaining Part of chunk
in the next Chunk File.
Note: Your
chunk_size
should beio.DEFAULT_BUFFER_SIZE
or a multiple of that.
You gain no speedup if you raise thechunk_size
to high.
Read this relevant SO QA: Default buffer size for a file
My Example shows use of resetting the Filepointer, for instance:
import io
large_data = b"""Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01"""
def split(chunk_size, split_size):
with io.BytesIO(large_data) as fh_in:
_size = 0
# Used to verify chunked writes
result_data = io.BytesIO()
while True:
chunk = fh_in.read(chunk_size)
print('read({})'.format(bytearray(chunk)))
if not chunk: break
_size += chunk_size
if _size >= split_size:
_size = 0
# Split on last 0x01
l = len(chunk)
print('\tsplit_on_last_\\x01({})\t{}'.format(l, bytearray(chunk)))
# Reverse iterate
for p in range(l-1, -1, -1):
c = chunk[p:p+1]
if ord(c) == ord('\x01'):
offset = l-(p+1)
# Condition if \x01 is the Last Byte in chunk
if offset == 0:
print('\toffset={} write({})\t\t{}'.format(offset, l - offset, bytearray(chunk)))
result_data.write(chunk)
else:
# Reset Fileppointer
fh_in.seek(fh_in.tell()-offset)
print('\toffset={} write({})\t\t{}'.format(offset, l-offset, bytearray(chunk[:-offset])))
result_data.write(chunk[:-offset])
break
else:
print('\twrite({}) {}'.format(chunk_size, bytearray(chunk)))
result_data.write(chunk)
print('INPUT :{}\nOUTPUT:{}'.format(large_data, result_data.getvalue()))
if __name__ == '__main__':
split(chunk_size=30, split_size=60)
Output:
read(bytearray(b'Lorem ipsum\x01dolor sit\x01sadipsci')) write(30) bytearray(b'Lorem ipsum\x01dolor sit\x01sadipsci') read(bytearray(b'ng elitr, sed\x01labore et\x01dolore')) split_on_last_\x01(30) bytearray(b'ng elitr, sed\x01labore et\x01dolore') offset=6 write(24) bytearray(b'ng elitr, sed\x01labore et\x01') read(bytearray(b'dolores et ea rebum.\x01magna ali')) write(30) bytearray(b'dolores et ea rebum.\x01magna ali') read(bytearray(b'quyam erat,\x01')) split_on_last_\x01(12) bytearray(b'quyam erat,\x01') offset=0 write(12) bytearray(b'quyam erat,\x01') read(bytearray(b'')) INPUT :b'Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01' OUTPUT:b'Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01'
Tested with Python: 3.4.2
这篇关于在python中将大文件拆分为小文件时出现内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!