从S3存储桶下载300万个对象的最快方法 [英] Fastest way to download 3 million objects from a S3 bucket

查看:148
本文介绍了从S3存储桶下载300万个对象的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我曾尝试使用Python + boto +多处理,S3cmd和J3tset,但都在挣扎.

I've tried using Python + boto + multiprocessing, S3cmd and J3tset but struggling with all of them.

有什么建议,也许是您一直在使用的现成脚本,还是我不知道的另一种方式?

Any suggestions, perhaps a ready-made script you've been using or another way I don't know of?

eventlet + boto是一个值得提及的解决方案,如下所述.在此处 http://web.archive.org/web/20110520140439/http://teddziuba.com/2010/02/eventlet-asynchronous-io-for-g.html

eventlet+boto is a worthwhile solution as mentioned below. Found a good eventlet reference article here http://web.archive.org/web/20110520140439/http://teddziuba.com/2010/02/eventlet-asynchronous-io-for-g.html

我现在在下面添加了我正在使用的python脚本.

I've added the python script that I'm using right now below.

推荐答案

好吧,我想出了一个基于@Matt Billenstien提示的解决方案.它使用事件库.第一步是最重要的(标准IO库的猴子补丁).

Okay, I figured out a solution based on @Matt Billenstien's hint. It uses eventlet library. The first step is most important here (monkey patching of standard IO libraries).

使用nohup在后台运行此脚本,一切就绪.

Run this script in the background with nohup and you're all set.

from eventlet import *
patcher.monkey_patch(all=True)

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket

import logging

logging.basicConfig(filename="s3_download.log", level=logging.INFO)


def download_file(key_name):
    # Its imp to download the key from a new connection
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")
    key = bucket.get_key(key_name)

    try:
        res = key.get_contents_to_filename(key.name)
    except:
        logging.info(key.name+":"+"FAILED")

if __name__ == "__main__":
    conn = S3Connection("KEY", "SECRET")
    bucket = Bucket(connection=conn, name="BUCKET")

    logging.info("Fetching bucket list")
    bucket_list = bucket.list(prefix="PREFIX")

    logging.info("Creating a pool")
    pool = GreenPool(size=20)

    logging.info("Saving files in bucket...")
    for key in bucket.list():
        pool.spawn_n(download_file, key.key)
    pool.waitall()

这篇关于从S3存储桶下载300万个对象的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆