使用boto3将大字符串流式传输到S3 [英] Stream large string to S3 using boto3

查看:265
本文介绍了使用boto3将大字符串流式传输到S3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从S3下载文件,转换文件中的数据,然后创建一个新文件上传到S3.我正在下载的文件不到2GB,但是由于我正在增强数据,因此当我上传它时,它会很大(200gb +).

I am downloading files from S3, transforming the data inside them, and then creating a new file to upload to S3. The files I am downloading are less than 2GB but because I am enhancing the data, when I go to upload it, it is quite large (200gb+).

目前,您可以通过代码来想象:

Currently you could imagine by code is like:

files = list_files_in_s3()
new_file = open('new_file','w')
for file in files:
    file_data = fetch_object_from_s3(file)
    str_out = ''
    for data in file_data:
        str_out += transform_data(data)
    new_file.write(str_out)
s3.upload_file('new_file', 'bucket', 'key')

此问题是'new_file'太大,有时无法容纳在磁盘上.因此,我想使用boto3 upload_fileobj以流形式上载数据,以便根本不需要将temp文件放在磁盘上.

The problem with this is that 'new_file' is too big to fit on disk sometimes. Because of this, I want to use boto3 upload_fileobj to upload the data in a stream form so that I don't need to have the temp file on disk at all.

有人可以提供示例吗? Python方法似乎与我熟悉的Java完全不同.

Can someone help provide an example of this? The Python method seems quite different from Java which I am familar with.

推荐答案

您可以在读取功能中使用amt参数,在此处记录:

You can use the amt-parameter in the read-function, documented here: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.

然后使用此处记录的MultiPartUpload逐段上传文件: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload

And then use MultiPartUpload documented here, to upload the file piece by piece: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload

https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

您应该有一条规则,删除不完整的分段上传内容:

You should have a rule that deletes incomplete multipart uploads:

否则,您可能最终要为S3中存储的不完整数据部分付费.

or else you may end up paying for incomplete data-parts stored in S3.

我复制了自己脚本中的内容以执行此操作.这显示了从下载到上传的所有过程.万一您有内存限制要考虑.您还可以更改此设置,以便在上传之前将文件存储在本地.

I copy-pasted something from my own script to to do this. This shows how you can stream all the way from downloading and to uploading. In case you have memory-limitations to consider. You could also alter this to store the file locally before you upload.

无论如何,您将不得不使用MultiPartUpload,因为S3在一次操作中可以上传多大文件受到限制:

You will have to use MultiPartUpload anyway, since S3 have limitations on how large files you can upload in one action: https://aws.amazon.com/s3/faqs/

"可以在单个PUT中上载的最大对象是5 GB.对于大于100兆字节的对象,客户应考虑使用分段上传功能."

"The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability."

这是一个代码示例(我尚未在此处测试此代码):

This is a code sample (I havent tested this code as it is here):

import boto3
amt = 1024*1024*10 # 10 MB at the time
session = boto3.Session(profile_name='yourprofile')
s3res = session.resource('s3')
source_s3file = "yourfile.file"
target_s3file = "yourfile.file"
source_s3obj = s3res.Object("your-bucket", source_s3file)
target_s3obj = s3res.Object("your-bucket", target_s3file)

# initiate MultiPartUpload
mpu = target_s3obj.initiate_multipart_upload()
partNr = 0
parts = []

body = source_s3obj.get()["Body"]   
# get initial chunk
chunk = body.read(amt=amt).decode("utf-8") # this is where you use the amt-parameter
# Every time you call the read-function it reads the next chunk of data until its empty.
# Then do something with the chunk and upload it to S3 using MultiPartUpload
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
    "PartNumber": partNr,
    "ETag": response["ETag"]
})

while len(chunk) > 0:
    # there is more data, get a new chunk
    chunk = body.read(amt=amt).decode("utf-8")
    # do something with the chunk, and upload the part
    partNr += 1
    part = mpu.Part(partNr)
    response = part.upload(Body=chunk)
    parts.append({
        "PartNumber": partNr,
        "ETag": response["ETag"]
    })
# no more chunks, complete the upload
part_info = {}
part_info["Parts"] = parts
mpu_result = mpu.complete(MultipartUpload=part_info)

这篇关于使用boto3将大字符串流式传输到S3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆