使用 boto3 时如何提高 AWS s3 上传速度? [英] How can I increase my AWS s3 upload speed when using boto3?

查看:39
本文介绍了使用 boto3 时如何提高 AWS s3 上传速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿,有一些类似的问题,但没有一个完全像这样,而且相当多的问题已经存在多年并且已经过时了.

Hey there were some similar questions, but none exactly like this and a fair number of them were multiple years old and out of date.

我在我的服务器上编写了一些代码,通过 boto3 方法 upload_file 使用密钥将 jpeg 照片上传到 s3 存储桶.最初这看起来很棒.这是一个将文件上传到s3的超级简单的解决方案.

I have written some code on my server that uploads jpeg photos into an s3 bucket using a key via the boto3 method upload_file. Initially this seemed great. It is a super simple solution to uploading files into s3.

问题是,我有用户.我的用户通过电话应用程序将他们的 jpeg 发送到我的服务器.虽然我承认我可以生成预先签名的上传网址并将它们发送到手机应用程序,但这需要对我们的手机应用程序和 API 进行大量重写.

The thing is, I have users. My users are sending their jpegs to my server via a phone app. While I concede that I could generate presigned upload URLs and send them to the phone app, that would require a considerable rewrite of our phone app and API.

所以我只希望手机应用程序将照片发送到服务器.然后我想将照片从服务器发送到 s3.我实现了这个,但它太慢了.我不能要求我的用户容忍这些缓慢的上传.

So I just want the phone app to send the photos to the server. I then want to send the photos from the server to s3. I implemented this but it is way too slow. I cannot ask my users to tolerate those slow uploads.

我可以做些什么来加快速度?

What can I do to speed this up?

我做了一些谷歌搜索,发现了这个:https://medium.com/@alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5

I did some Google searching and found this: https://medium.com/@alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5

它表明解决方案是增加 TCP/IP 连接数.更多的 TCP/IP 连接意味着更快的上传.

It suggests that the solution is to increase the number of TCP/IP connections. More TCP/IP connections means faster uploads.

好的,太好了!

我该怎么做?如何增加 TCP/IP 连接数,以便更快地将单个 jpeg 上传到 AWS s3?

How do I do that? How do I increase the number of TCP/IP connections so I can upload a single jpeg into AWS s3 faster?

请帮忙.

推荐答案

具有讽刺意味的是,我们多年来一直在使用 boto3 以及 awscli,我们喜欢他们两个.

Ironically, we've been using boto3 for years, as well as awscli, and we like them both.

但我们经常想知道为什么 awscli 的 aws s3 cp --recursiveaws s3 sync 通常比尝试进行大量上传要快得多通过 boto3,即使使用 concurrent.futuresThreadPoolExecutorProcessPoolExecutor(你甚至不敢分享在您的工作人员中使用相同的 s3.Bucket:在 文档,并且有充分的理由;令人讨厌的崩溃最终会在最不方便的时候发生).

But we've often wondered why awscli's aws s3 cp --recursive, or aws s3 sync, are often so much faster than trying to do a bunch of uploads via boto3, even with concurrent.futures's ThreadPoolExecutor or ProcessPoolExecutor (and don't you even dare sharing the same s3.Bucket among your workers: it's warned against in the docs, and for good reasons; nasty crashes will eventually ensue at the most inconvenient time).

最后,我硬着头皮查看了 awscli 在 boto3 之上引入的自定义"代码.

Finally, I bit the bullet and looked inside the "customization" code that awscli introduces on top of boto3.

基于这个小小的探索,这里有一种通过使用 boto3.s3.transfer 中内置的并发来加速许多文件上传到 S3 的方法,而不仅仅是为了可能的多部分单个大文件,但也适用于各种大小的一大堆文件.据我所知,该功能并未通过 boto3 文档中描述的 boto3 的更高级别 API 公开.

Based on that little exploration, here is a way to speed up the upload of many files to S3 by using the concurrency already built in boto3.s3.transfer, not just for the possible multiparts of a single, large file, but for a whole bunch of files of various sizes as well. That functionality is, as far as I know, not exposed through the higher level APIs of boto3 that are described in the boto3 docs.

以下内容:

  1. 使用 boto3.s3.transfer 创建一个 TransferManager,与 awscli 的 aws s3 sync 使用的相同,例如.

  1. Uses boto3.s3.transfer to create a TransferManager, the very same one that is used by awscli's aws s3 sync, for example.

将最大线程数扩展到 20.

Extends the max number of threads to 20.

增加 botocore 使用的底层 urllib3 最大池连接容量以匹配(默认情况下,它最多使用 10 个连接).

Augments the underlying urllib3 max pool connections capacity used by botocore to match (by default, it uses 10 connections maximum).

为您提供可选的回调功能(此处使用 tqdm 进度条进行演示,当然您可以使用任何您想要的回调).

Gives you an optional callback capability (demoed here with a tqdm progress bar, but of course you can have whatever callback you'd like).

(超过 100MB/s -- 在 ec2 实例上测试).

Is fast (over 100MB/s --tested on an ec2 instance).

我把一个完整的例子作为要点这里,其中包括生成 500 个随机 csv 文件总共约 360MB.但是,如果您假设您在下面的 filelist 中有一堆文件,总共 total_size 字节:

I put a complete example as a gist here that includes the generation of 500 random csv files for a total of about 360MB. But if you assume you aready have a bunch of files in filelist below, for a total of total_size bytes:

import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer
import tqdm

botocore_config = botocore.config.Config(max_pool_connections=20)
s3client = boto3.client('s3', config=botocore_config)

transfer_config = s3transfer.TransferConfig(
    use_threads=True,
    max_concurrency=20,
)

bucket_name = '<your-bucket-name>'
s3junkdir = 'some/path/for/junk'

%%time
progress = tqdm.tqdm(
    desc='upload',
    total=total_size, unit='B', unit_scale=1,
    position=0,
    bar_format='{desc:<10}{percentage:3.0f}%|{bar:10}{r_bar}')

s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
for src in filelist:
    dst = os.path.join(s3junkdir, os.path.basename(src))
    s3t.upload(
        src, bucket_name, dst,
        subscribers=[
            s3transfer.ProgressCallbackInvoker(progress.update),
        ],
    )

s3t.shutdown()  # wait for all the upload tasks to finish
progress.close();

这篇关于使用 boto3 时如何提高 AWS s3 上传速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆