从Amazon S3读取大尺寸JSON文件时使用read()方法时出现MemoryError [英] MemoryError when Using the read() Method in Reading a Large Size of JSON file from Amazon S3

查看:497
本文介绍了从Amazon S3读取大尺寸JSON文件时使用read()方法时出现MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试使用Python将Amazon S3中的大尺寸JSON FILE导入AWS RDS-PostgreSQL。但是,发生了这些错误,

I'm trying to import a large size of JSON FILE from Amazon S3 into AWS RDS-PostgreSQL using Python. But, these errors occured,


Traceback(最近一次通话):

Traceback (most recent call last):

文件 my_code.py,第67行,位于

File "my_code.py", line 67, in

file_content = obj ['Body']。read()。decode('utf-8')。splitlines (正确)

file_content = obj['Body'].read().decode('utf-8').splitlines(True)

文件 /home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages /botocore/response.py,第76行,处于读取状态

File "/home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/response.py", line 76, in read

chunk = self._raw_stream.read(amt)

chunk = self._raw_stream.read(amt)

文件 /home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/vendored/requests/packages/urllib3/response.py ,第239行,处于读取状态

File "/home/user/asd-to-qwe/fgh-to-hjk/env/local/lib/python3.6/site-packages/botocore/vendored/requests/packages/urllib3/response.py", line 239, in read

data = self._fp.read()

data = self._fp.read()

文件 / usr / lib64 / python3.6 / http / client.py,第462行,处于读取状态

File "/usr/lib64/python3.6/http/client.py", line 462, in read

s = self._safe_read(self.length)

s = self._safe_read(self.length)

文件 /usr/lib64/python3.6/http/client.py,第617行,位于_safe_read

File "/usr/lib64/python3.6/http/client.py", line 617, in _safe_read

返回b .join(s)

return b"".join(s)

MemoryError

MemoryError

// my_code.py

// my_code.py

import sys
import boto3
import psycopg2
import zipfile
import io
import json

s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
connection = psycopg2.connect(host=<host>, dbname=<dbname>, user=<user>, password=<password>)
cursor = connection.cursor()

bucket = sys.argv[1]
key = sys.argv[2]
obj = s3.get_object(Bucket=bucket, Key=key)

def insert_query(data):
    query = """
        INSERT INTO data_table
        SELECT
            (src.test->>'url')::varchar, (src.test->>'id')::bigint,
            (src.test->>'external_id')::bigint, (src.test->>'via')::jsonb
        FROM (SELECT CAST(%s AS JSONB) AS test) src
    """
    cursor.execute(query, (json.dumps(data),))


if key.endswith('.zip'):
    zip_files = obj['Body'].read()
    with io.BytesIO(zip_files) as zf:
        zf.seek(0)
        with zipfile.ZipFile(zf, mode='r') as z:
            for filename in z.namelist():
                with z.open(filename) as f:
                    for line in f:
                        insert_query(json.loads(line.decode('utf-8')))
if key.endswith('.json'):
    file_content = obj['Body'].read().decode('utf-8').splitlines(True)
    for line in file_content:
        insert_query(json.loads(line))


connection.commit()
connection.close()

这些问题是否有解决方案?任何帮助都可以,非常感谢!

Are there any solutions to these problems? Any help would do, thank you so much!

推荐答案

通过避免将整个输入文件拖入内存可以节省大量资金。作为行的列表

A significant savings can be had by avoiding slurping your whole input file into memory as a list of lines.

具体来说,这些行在内存使用方面很糟糕,因为它们涉及到峰值字节对象的内存使用量等于整个文件的大小,再加上 list 行的内容文件以及:

Specifically, these lines are terrible on memory usage, in that they involve a peak memory usage of a bytes object the size of your whole file, plus a list of lines with the complete contents of the file as well:

file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for line in file_content:

对于1 GB ASCII文本文件在64位Python 3.3+上,有500万行,对于 just bytes 对象列表,以及列表中的单个 str 。一个程序需要2.3倍于其处理文件大小的RAM的程序将无法缩放到大文件。

For a 1 GB ASCII text file with 5 million lines, on 64 bit Python 3.3+, that's a peak memory requirement of roughly 2.3 GB for just the bytes object, the list, and the individual strs in the list. A program that needs 2.3x as much RAM as the size of the files it processes won't scale to large files.

要修复,请将原始代码更改为: / p>

To fix, change that original code to:

file_content = io.TextIOWrapper(obj['Body'], encoding='utf-8')
for line in file_content:

给出 obj ['Body'] 似乎是可用于延迟流,这应从内存中删除完整文件数据的两个副本。使用 TextIOWrapper 意味着 obj ['Body'] 会被懒惰地读取和解码(每次几个KB) ),并且这些行也会延迟进行迭代;这样,无论文件大小如何,都可以将内存需求减少到很小的固定数量(峰值内存成本取决于最长行的长度)。

Given that obj['Body'] appears to be usable for lazy streaming this should remove both copies of the complete file data from memory. Using TextIOWrapper means obj['Body'] is lazily read and decoded in chunks (of a few KB at a time), and the lines are iterated lazily as well; this reduces memory demands to a small, largely fixed amount (the peak memory cost would depend on the length of the longest line), regardless of file size.

更新:

看起来 StreamingBody 没有实现 io .BufferedIOBase ABC。它确实具有其自己的文档API ,可用于类似的目的。如果您不能让 TextIOWrapper 为您完成工作(如果可以工作的话,效率会更高,更简单),您可以选择以下方法: / p>

It looks like StreamingBody doesn't implement the io.BufferedIOBase ABC. It does have its own documented API though, that can be used for a similar purpose. If you can't make the TextIOWrapper do the work for you (it's much more efficient and simple if it can be made to work), an alternative would be to do:

file_content = (line.decode('utf-8') for line in obj['Body'].iter_lines())
for line in file_content:

不同于使用 TextIOWrapper ,它不能从块的批量解码中受益(每行都是单独解码的),但是在减少内存使用方面,它仍然应该具有相同的优势。

Unlike using TextIOWrapper, it doesn't benefit from bulk decoding of blocks (each line is decoded individually), but otherwise it should still achieve the same benefits in terms of reduced memory usage.

这篇关于从Amazon S3读取大尺寸JSON文件时使用read()方法时出现MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆