使用boto3将大字符串流式传输到S3 [英] Stream large string to S3 using boto3

查看：265 发布时间：2020/8/23 6:39:23 python-3.x amazon-s3 boto3

本文介绍了使用boto3将大字符串流式传输到S3的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从S3下载文件，转换文件中的数据，然后创建一个新文件上传到S3.我正在下载的文件不到2GB，但是由于我正在增强数据，因此当我上传它时，它会很大(200gb +).

I am downloading files from S3, transforming the data inside them, and then creating a new file to upload to S3. The files I am downloading are less than 2GB but because I am enhancing the data, when I go to upload it, it is quite large (200gb+).

目前，您可以通过代码来想象:

Currently you could imagine by code is like:

files = list_files_in_s3()
new_file = open('new_file','w')
for file in files:
    file_data = fetch_object_from_s3(file)
    str_out = ''
    for data in file_data:
        str_out += transform_data(data)
    new_file.write(str_out)
s3.upload_file('new_file', 'bucket', 'key')

此问题是'new_file'太大，有时无法容纳在磁盘上.因此，我想使用boto3 upload_fileobj以流形式上载数据，以便根本不需要将temp文件放在磁盘上.

The problem with this is that 'new_file' is too big to fit on disk sometimes. Because of this, I want to use boto3 upload_fileobj to upload the data in a stream form so that I don't need to have the temp file on disk at all.

有人可以提供示例吗? Python方法似乎与我熟悉的Java完全不同.

Can someone help provide an example of this? The Python method seems quite different from Java which I am familar with.

推荐答案

您可以在读取功能中使用amt参数，在此处记录:

You can use the amt-parameter in the read-function, documented here: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.

然后使用此处记录的MultiPartUpload逐段上传文件: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload

And then use MultiPartUpload documented here, to upload the file piece by piece: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload

https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html

您应该有一条规则，删除不完整的分段上传内容:

You should have a rule that deletes incomplete multipart uploads:

否则，您可能最终要为S3中存储的不完整数据部分付费.

or else you may end up paying for incomplete data-parts stored in S3.

我复制了自己脚本中的内容以执行此操作.这显示了从下载到上传的所有过程.万一您有内存限制要考虑.您还可以更改此设置，以便在上传之前将文件存储在本地.

I copy-pasted something from my own script to to do this. This shows how you can stream all the way from downloading and to uploading. In case you have memory-limitations to consider. You could also alter this to store the file locally before you upload.

无论如何，您将不得不使用MultiPartUpload，因为S3在一次操作中可以上传多大文件受到限制:

You will have to use MultiPartUpload anyway, since S3 have limitations on how large files you can upload in one action: https://aws.amazon.com/s3/faqs/

"可以在单个PUT中上载的最大对象是5 GB.对于大于100兆字节的对象，客户应考虑使用分段上传功能."

"The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability."

这是一个代码示例(我尚未在此处测试此代码):

This is a code sample (I havent tested this code as it is here):

import boto3
amt = 1024*1024*10 # 10 MB at the time
session = boto3.Session(profile_name='yourprofile')
s3res = session.resource('s3')
source_s3file = "yourfile.file"
target_s3file = "yourfile.file"
source_s3obj = s3res.Object("your-bucket", source_s3file)
target_s3obj = s3res.Object("your-bucket", target_s3file)

# initiate MultiPartUpload
mpu = target_s3obj.initiate_multipart_upload()
partNr = 0
parts = []

body = source_s3obj.get()["Body"]   
# get initial chunk
chunk = body.read(amt=amt).decode("utf-8") # this is where you use the amt-parameter
# Every time you call the read-function it reads the next chunk of data until its empty.
# Then do something with the chunk and upload it to S3 using MultiPartUpload
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
    "PartNumber": partNr,
    "ETag": response["ETag"]
})

while len(chunk) > 0:
    # there is more data, get a new chunk
    chunk = body.read(amt=amt).decode("utf-8")
    # do something with the chunk, and upload the part
    partNr += 1
    part = mpu.Part(partNr)
    response = part.upload(Body=chunk)
    parts.append({
        "PartNumber": partNr,
        "ETag": response["ETag"]
    })
# no more chunks, complete the upload
part_info = {}
part_info["Parts"] = parts
mpu_result = mpu.complete(MultipartUpload=part_info)

这篇关于使用boto3将大字符串流式传输到S3的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用boto3将大字符串流式传输到S3 [英] Stream large string to S3 using boto3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用boto3将大字符串流式传输到S3 [英] Stream large string to S3 using boto3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭