针对大量微小文件优化S3下载 [英] Optimize S3 download for large number of tiny files

查看:345
本文介绍了针对大量微小文件优化S3下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前使用 TransferManager 从Lambda函数下载S3存储桶中的所有文件。

I currently use TransferManager to download all files in an S3 bucket, from a Lambda function.

// Initialize
TransferManagerBuilder txBuilder = TransferManagerBuilder.standard();
// txBuilder.setExecutorFactory(() -> Executors.newFixedThreadPool(50));
TransferManager tx = txBuilder.build();
final Path tmpDir = Files.createTempDirectory("/tmp/s3_download/");

// Download
MultipleFileDownload download = tx.downloadDirectory(bucketName,
                                                     bucketKey,
                                                     new File(tmpDir.toUri()));
download.waitForCompletion();

return Files.list(tmpDir.resolve(bucketKey)).collect(Collectors.toList());

似乎花费 300秒来下载 10,000个文件(大小每个约20KB ),给我一个大约的转移率666 KBps
增加线程池大小似乎根本不会影响传输速率。

It seems to take around 300 seconds to download 10,000 files (of size ~20KB each), giving me a transfer rate of about 666 KBps. Increasing the thread pool size doesn't seem to affect the transfer rate at all.

S3端点和lambda函数位于同一AWS区域,并在同一个AWS账户中。

The S3 endpoint, and the lambda function are in the same AWS region, and in the same AWS account.

如何优化S3下载?

推荐答案

处理大量数据总是需要根据底层系统构建存储。

Dealing with a large number of data always needs architecting your storage with regards to underlying systems.

如果您需要高吞吐量,则需要对s3键进行分区,以便它可以容纳大量请求。分布式计算具有高性能的自身需求,这就是这样的需求。

If you need high throughputs you need to partition your s3 keys so that it can accommodate a high number of requests. Distributed computing comes with own needs to serve with high performance, this is one such need.

请求率注意事项:

https ://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

转移加速:

https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html

如何提高吞吐量:

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

希望它有所帮助。

EDIT1

我发现您正在尝试将文件下载到Ephemeral存储,您需要了解存储限制。这些不适用于批量处理。

I see that you are trying to download files to Ephemeral storage, you need to be aware of storage limits. Those are not meant for bulk processing.

https ://docs.aws.amazon.com/lambda/latest/dg/limits.html

这篇关于针对大量微小文件优化S3下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆