将大 pandas DataFrame保存到S3的最快方法是什么? [英] What is the fastest way to save a large pandas DataFrame to S3?

查看:102
本文介绍了将大 pandas DataFrame保存到S3的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出将大熊猫DataFrame写入S3文件系统的最快方法.我目前正在尝试两种方法:

I am trying to figure out what is the fastest way to write a LARGE pandas DataFrame to S3 filesystem. I am currently trying two ways:

1)通过gzip压缩(BytesIO)和boto3

1) Through gzip compression (BytesIO) and boto3

gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object(bucket, s3_path + name_zip)
s3_object.put(Body=gz_buffer.getvalue())

对于700万行的数据帧,大约需要420秒才能写入S3.

which for a dataframe of 7M rows takes around 420seconds to write to S3.

2)通过在不压缩的情况下写入csv文件(StringIO缓冲区)

2) Through writing to csv file without compression (StringIO buffer)

csv_buffer = StringIO()
data.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, s3_path + name_csv).put(Body=csv_buffer.getvalue())

这大约需要371秒...

which takes around 371 seconds...

问题是: 还有其他更快的方法可以将熊猫数据帧写入S3吗?

The question is: Is there any other faster way to write a pandas dataframe to S3?

推荐答案

使用分段上传可以更快地传输到S3.压缩使文件更小,因此也有帮助.

Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.

import boto3
s3 = boto3.client('s3')

csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')

# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)

s3.upload_fileobj的文档在这里: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj

这篇关于将大 pandas DataFrame保存到S3的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆