使用boto3将大字符串流式传输到S3 [英] Stream large string to S3 using boto3
问题描述
我正在从S3下载文件,转换文件中的数据,然后创建一个新文件上传到S3.我正在下载的文件不到2GB,但是由于我正在增强数据,因此当我上传它时,它会很大(200gb +).
I am downloading files from S3, transforming the data inside them, and then creating a new file to upload to S3. The files I am downloading are less than 2GB but because I am enhancing the data, when I go to upload it, it is quite large (200gb+).
目前,您可以通过代码来想象:
Currently you could imagine by code is like:
files = list_files_in_s3()
new_file = open('new_file','w')
for file in files:
file_data = fetch_object_from_s3(file)
str_out = ''
for data in file_data:
str_out += transform_data(data)
new_file.write(str_out)
s3.upload_file('new_file', 'bucket', 'key')
此问题是'new_file'太大,有时无法容纳在磁盘上.因此,我想使用boto3 upload_fileobj
以流形式上载数据,以便根本不需要将temp文件放在磁盘上.
The problem with this is that 'new_file' is too big to fit on disk sometimes. Because of this, I want to use boto3 upload_fileobj
to upload the data in a stream form so that I don't need to have the temp file on disk at all.
有人可以提供示例吗? Python方法似乎与我熟悉的Java完全不同.
Can someone help provide an example of this? The Python method seems quite different from Java which I am familar with.
推荐答案
You can use the amt-parameter in the read-function, documented here: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.
然后使用此处记录的MultiPartUpload逐段上传文件: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload
And then use MultiPartUpload documented here, to upload the file piece by piece: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#multipartupload
https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html
您应该有一条规则,删除不完整的分段上传内容:
You should have a rule that deletes incomplete multipart uploads:
or else you may end up paying for incomplete data-parts stored in S3.
我复制了自己脚本中的内容以执行此操作.这显示了从下载到上传的所有过程.万一您有内存限制要考虑.您还可以更改此设置,以便在上传之前将文件存储在本地.
I copy-pasted something from my own script to to do this. This shows how you can stream all the way from downloading and to uploading. In case you have memory-limitations to consider. You could also alter this to store the file locally before you upload.
无论如何,您将不得不使用MultiPartUpload,因为S3在一次操作中可以上传多大文件受到限制:
You will have to use MultiPartUpload anyway, since S3 have limitations on how large files you can upload in one action: https://aws.amazon.com/s3/faqs/
"可以在单个PUT中上载的最大对象是5 GB.对于大于100兆字节的对象,客户应考虑使用分段上传功能."
"The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability."
这是一个代码示例(我尚未在此处测试此代码):
This is a code sample (I havent tested this code as it is here):
import boto3
amt = 1024*1024*10 # 10 MB at the time
session = boto3.Session(profile_name='yourprofile')
s3res = session.resource('s3')
source_s3file = "yourfile.file"
target_s3file = "yourfile.file"
source_s3obj = s3res.Object("your-bucket", source_s3file)
target_s3obj = s3res.Object("your-bucket", target_s3file)
# initiate MultiPartUpload
mpu = target_s3obj.initiate_multipart_upload()
partNr = 0
parts = []
body = source_s3obj.get()["Body"]
# get initial chunk
chunk = body.read(amt=amt).decode("utf-8") # this is where you use the amt-parameter
# Every time you call the read-function it reads the next chunk of data until its empty.
# Then do something with the chunk and upload it to S3 using MultiPartUpload
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
"PartNumber": partNr,
"ETag": response["ETag"]
})
while len(chunk) > 0:
# there is more data, get a new chunk
chunk = body.read(amt=amt).decode("utf-8")
# do something with the chunk, and upload the part
partNr += 1
part = mpu.Part(partNr)
response = part.upload(Body=chunk)
parts.append({
"PartNumber": partNr,
"ETag": response["ETag"]
})
# no more chunks, complete the upload
part_info = {}
part_info["Parts"] = parts
mpu_result = mpu.complete(MultipartUpload=part_info)
这篇关于使用boto3将大字符串流式传输到S3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!