逐行写入文件时内存不足[Python] [英] Running out of RAM when writing to a file line by line [Python]
问题描述
我对一些大数据有数据处理任务.我使用Python在EC2上运行脚本,如下所示:
I have a data processing task on some large data. I run the script on EC2 using Python that looks something like the following:
with open(LARGE_FILE, 'r') as f:
with open(OUTPUT_FILE, 'w') as out:
for line in f:
results = some_computation(line)
out.write(json.dumps(results))
out.write('\n')
我逐行遍历数据,并将结果逐行写入另一个文件.
I loop over the data line by line and write the results to another file line by line.
运行了几个小时后,我无法登录服务器.我必须重新启动实例才能继续.
After running it for a few hours, I can't log in to the server. I would have to restart the instance to continue.
$ ssh ubuntu@$IP_ADDRESS
ssh_exchange_identification: read: Connection reset by peer
服务器可能内存不足.写入文件时,RAM会缓慢爬升.我不确定为什么逐行读取和写入内存时会出现问题.
It's likely the server is running out of RAM. When writing to the file, RAM slowly creeps up. I am not sure why memory would be a problem when reading and writing line by line.
我有足够的硬盘空间.
I think closest to this issue: Does the Python "open" function save its content in memory or in a temp file?
推荐答案
我正在使用SpaCy对文本进行一些预处理.看起来使用令牌生成器会导致内存稳定增长.
I was using SpaCy to do some preprocessing of text. Looks like using the tokenizer causes steady memory growth.
https://github.com/spacy-io/spaCy/issues/285
这篇关于逐行写入文件时内存不足[Python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!