同步两个 Amazon S3 存储桶的最快方法 [英] Fastest way to sync two Amazon S3 buckets

查看:32
本文介绍了同步两个 Amazon S3 存储桶的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大约 400 万个文件的 S3 存储桶,总共占用了大约 500GB.我需要将文件同步到一个新的存储桶(实际上更改存储桶的名称就足够了,但由于这是不可能的,我需要创建一个新的存储桶,将文件移到那里,然后删除旧的).

I have a S3 bucket with around 4 million files taking some 500GB in total. I need to sync the files to a new bucket (actually changing the name of the bucket would suffice, but as that is not possible I need to create a new bucket, move the files there, and remove the old one).

我正在使用 AWS CLI 的 s3 sync 命令,它可以完成这项工作,但需要很多时间.我想减少时间,以便将相关系统停机时间降至最低.

I'm using AWS CLI's s3 sync command and it does the job, but takes a lot of time. I would like to reduce the time so that the dependent system downtime is minimal.

我试图从我的本地机器和 EC2 c4.xlarge 实例运行同步,所用时间没有太大差异.

I was trying to run the sync both from my local machine and from EC2 c4.xlarge instance and there isn't much difference in time taken.

我注意到,当我使用 --exclude--include 选项将作业分成多个批次并并行运行它们时,所花费的时间会有所减少从单独的终端窗口,即

I have noticed that the time taken can be somewhat reduced when I split the job in multiple batches using --exclude and --include options and run them in parallel from separate terminal windows, i.e.

aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "1?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "2?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "3?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "*" --include "4?/*" 
aws s3 sync s3://source-bucket s3://destination-bucket --exclude "1?/*" --exclude "2?/*" --exclude "3?/*" --exclude "4?/*"

我还能做些什么来进一步加快同步速度?其他类型的 EC2 实例是否更适合这项工作?将作业分成多个批次是否是个好主意?是否存在可以在同一个存储桶上并行运行的最佳"数量的 sync 进程?

Is there anything else I can do speed up the sync even more? Is another type of EC2 instance more suitable for the job? Is splitting the job into multiple batches a good idea and is there something like 'optimal' number of sync processes that can run in parallel on the same bucket?

更新

我倾向于在关闭系统之前同步存储桶的策略,进行迁移,然后再次同步存储桶以仅复制在此期间发生更改的少量文件.然而,即使在没有差异的存储桶上运行相同的 sync 命令也需要花费大量时间.

I'm leaning towards the strategy of syncing the buckets before taking the system down, do the migration, and then sync the buckets again to copy only the small number of files that changed in the meantime. However running the same sync command even on buckets with no differences takes a lot of time.

推荐答案

您可以使用 EMR 和 S3-distcp.我不得不在两个存储桶之间同步 153 TB,这大约需要 9 天.还要确保存储桶位于同一区域,因为您还会受到数据传输成本的影响.

You can use EMR and S3-distcp. I had to sync 153 TB between two buckets and this took about 9 days. Also make sure the buckets are in the same region because you also get hit with data transfer costs.

aws emr add-steps --cluster-id <value> --steps Name="Command Runner",Jar="command-runner.jar",[{"Args":["s3-dist-cp","--s3Endpoint","s3.amazonaws.com","--src","s3://BUCKETNAME","--dest","s3://BUCKETNAME"]}]

http://docs.aws.amazon.com/ElasticMapReduce/最新/DeveloperGuide/UsingEMR_s3distcp.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

这篇关于同步两个 Amazon S3 存储桶的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆