使用Python从Amazon S3将大尺寸的压缩JSON文件导入AWS RDS-PostgreSQL [英] Importing Large Size of Zipped JSON File from Amazon S3 into AWS RDS-PostgreSQL Using Python

查看:197
本文介绍了使用Python从Amazon S3将大尺寸的压缩JSON文件导入AWS RDS-PostgreSQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python将Amazon S3中的大ZIPPED JSON FILE导入AWS RDS-PostgreSQL。但是,发生了这些错误,

I'm trying to import a large size of ZIPPED JSON FILE from Amazon S3 into AWS RDS-PostgreSQL using Python. But, these errors occured,


Traceback(最近一次通话是最近一次):

Traceback (most recent call last):

文件 my_code.py,第64行,在
中file_content = f.read()。decode('utf-8')。splitlines(True)

File "my_code.py", line 64, in file_content = f.read().decode('utf-8').splitlines(True)

文件 /usr/lib64/python3.6/zipfile.py,行835,读入
buf + = self._read1(self.MAX_N)

File "/usr/lib64/python3.6/zipfile.py", line 835, in read buf += self._read1(self.MAX_N)

文件 /usr/lib64/python3.6/zipfile.py,第925行,位于_read1
data = self._decompressor.decompress(data,n)

File "/usr/lib64/python3.6/zipfile.py", line 925, in _read1 data = self._decompressor.decompress(data, n)

MemoryError

MemoryError

// my_code.py

//my_code.py

import sys
import boto3
import psycopg2
import zipfile
import io
import json
import config

s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
connection = psycopg2.connect(host=<host>, dbname=<dbname>, user=<user>, password=<password>)
cursor = connection.cursor()

bucket = sys.argv[1]
key = sys.argv[2]
obj = s3.get_object(Bucket=bucket, Key=key)


def insert_query():
    query = """
        INSERT INTO data_table
        SELECT
            (src.test->>'url')::varchar, (src.test->>'id')::bigint,
            (src.test->>'external_id')::bigint, (src.test->>'via')::jsonb
        FROM (SELECT CAST(%s AS JSONB) AS test) src
    """
    cursor.execute(query, (json.dumps(data),))


if key.endswith('.zip'):
    zip_files = obj['Body'].read()
    with io.BytesIO(zip_files) as zf:
        zf.seek(0)
        with zipfile.ZipFile(zf, mode='r') as z:
            for filename in z.namelist():
                with z.open(filename) as f:
                    file_content = f.read().decode('utf-8').splitlines(True)
                    for row in file_content:
                        data = json.loads(row)
                        insert_query()
if key.endswith('.json'):
    file_content = obj['Body'].read().decode('utf-8').splitlines(True)
    for row in file_content:
        data = json.loads(row)
        insert_query()

connection.commit()
connection.close()

有没有解决这些问题的方法?任何帮助都可以,非常感谢您!

Are there any solutions to these problems? Any help would do, thank you so much!

推荐答案

问题是您试图将整个文件读入内存中。时间,如果文件确实太大,可能会导致内存不足。

The problem is that you try to read an entire file into memory at a time, which can cause you to run out of memory if the file is indeed too large.

您应该一次读取一行文件,因为每一行都在文件显然是JSON字符串,您可以直接在循环中处理每一行:

You should read the file one line at a time, and since each line in a file is apparently a JSON string, you can process each line directly in the loop:

with z.open(filename) as f:
    for line in f:
        insert_query(json.loads(line.decode('utf-8')))

您的 insert_query 函数应接受 data 作为参数,方式:

Your insert_query function should accept data as a parameter, by the way:

def insert_query(data):

这篇关于使用Python从Amazon S3将大尺寸的压缩JSON文件导入AWS RDS-PostgreSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆