如何在Python中有效地将小文件上传到Amazon S3 [英] How to upload small files to Amazon S3 efficiently in Python

查看:612
本文介绍了如何在Python中有效地将小文件上传到Amazon S3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我需要实现一个程序,以尽快将驻留在Amazon EC2中的文件上传到Python中的S3.文件大小为30KB.

Recently, I need to implement a program to upload files resides in Amazon EC2 to S3 in Python as quickly as possible. And the size of files are 30KB.

我尝试了一些解决方案,使用多线程,多处理,协同例程.以下是我在Amazon EC2上的性能测试结果.

I have tried some solutions, using multiple threading, multiple processing, co-routine. The following is my performance test result on Amazon EC2.

3600(文件数量)* 30K(文件大小)~~ 105M(总计)--->

3600 (the amount of files) * 30K (file size) ~~ 105M (Total) --->

       **5.5s [ 4 process + 100 coroutine ]**
       10s  [ 200 coroutine ]
       14s  [ 10 threads ]

如下所示的代码

用于多线程

def mput(i, client, files):
    for f in files:
        if hash(f) % NTHREAD == i:
            put(client, os.path.join(DATA_DIR, f))


def test_multithreading():
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    ths = [threading.Thread(target=mput, args=(i, client, files)) for i in range(NTHREAD)]
    for th in ths:
        th.daemon = True
        th.start()
    for th in ths:
        th.join()

用于协程

client = connect_to_s3_sevice()
pool = eventlet.GreenPool(int(sys.argv[2]))

xput = functools.partial(put, client)
files = os.listdir(DATA_DIR)
for f in files:
    pool.spawn_n(xput, os.path.join(DATA_DIR, f))
pool.waitall()

用于多处理

def pproc(i):
    client = connect_to_s3_sevice()
    files = os.listdir(DATA_DIR)
    pool = eventlet.GreenPool(100)

    xput = functools.partial(put, client)
    for f in files:
        if hash(f) % NPROCESS == i:
            pool.spawn_n(xput, os.path.join(DATA_DIR, f))
    pool.waitall()


def test_multiproc():
    procs = [multiprocessing.Process(target=pproc, args=(i, )) for i in range(NPROCESS)]
    for p in procs:
        p.daemon = True
        p.start()
    for p in procs:
        p.join()

计算机的配置为 Ubuntu 14.04、2个CPU(2.50GHz),4G内存

达到的最高速度约为 19Mb/s(105/5.5).总体而言,它太慢了.有什么办法可以加快速度吗?没有堆栈的python可以做得更快吗?

The highest speed reached is about 19Mb/s (105 / 5.5). Overall, it is too slow. Any way to speed it up? Does stackless python could do it faster?

推荐答案

使用Python boto SDK到Amazon S3的并行上传时间示例如下:

Sample parallel upload times to Amazon S3 using the Python boto SDK are available here:

您可以考虑调用 AWS命令行界面(CLI),可以并行上传.它也是用Python编写的,并使用boto.

Rather than writing the code yourself, you might also consider calling out to the AWS Command Line Interface (CLI), which can do uploads in parallel. It is also written in Python and uses boto.

这篇关于如何在Python中有效地将小文件上传到Amazon S3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆