逐行写入文件时内存不足[Python] [英] Running out of RAM when writing to a file line by line [Python]

查看:123
本文介绍了逐行写入文件时内存不足[Python]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对一些大数据有数据处理任务.我使用Python在EC2上运行脚本,如下所示:

I have a data processing task on some large data. I run the script on EC2 using Python that looks something like the following:

with open(LARGE_FILE, 'r') as f:
    with open(OUTPUT_FILE, 'w') as out:
        for line in f:
            results = some_computation(line)
            out.write(json.dumps(results))
            out.write('\n')

我逐行遍历数据,并将结果逐行写入另一个文件.

I loop over the data line by line and write the results to another file line by line.

运行了几个小时后,我无法登录服务器.我必须重新启动实例才能继续.

After running it for a few hours, I can't log in to the server. I would have to restart the instance to continue.

$ ssh ubuntu@$IP_ADDRESS
ssh_exchange_identification: read: Connection reset by peer

服务器可能内存不足.写入文件时,RAM会缓慢爬升.我不确定为什么逐行读取和写入内存时会出现问题.

It's likely the server is running out of RAM. When writing to the file, RAM slowly creeps up. I am not sure why memory would be a problem when reading and writing line by line.

我有足够的硬盘空间.

我认为与这个问题最接近:

I think closest to this issue: Does the Python "open" function save its content in memory or in a temp file?

推荐答案

我正在使用SpaCy对文本进行一些预处理.看起来使用令牌生成器会导致内存稳定增长.

I was using SpaCy to do some preprocessing of text. Looks like using the tokenizer causes steady memory growth.

https://github.com/spacy-io/spaCy/issues/285

这篇关于逐行写入文件时内存不足[Python]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆