pandas 数据框上的s3fs gzip压缩 [英] s3fs gzip compression on pandas dataframe

查看:106
本文介绍了 pandas 数据框上的s3fs gzip压缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过使用 s3fs 库在S3上将数据帧作为CSV文件编写和大熊猫. 尽管有文档,但恐怕gzip压缩参数不适用于s3fs.

I'm trying to write a dataframe as a CSV file on S3 by using the s3fs library and pandas. Despite the documentation, I'm afraid the gzip compression parameter it's not working with s3fs.

def DfTos3Csv (df,file):
    with fs.open(file,'wb') as f:
       df.to_csv(f, compression='gzip', index=False)

此代码在S3中将数据框另存为新对象,但以纯CSV格式(而不是gzip格式)保存. 另一方面,使用此压缩参数,读取功能可以正常工作.

This code saves the dataframe as a new object in S3 but in a plain CSV not in a gzip format. On the other hand, the read functionality it's working OK using this compression parameter.

def s3CsvToDf(file):
   with fs.open(file) as f:
      df = pd.read_csv(f, compression='gzip')
  return df

有关写问题的建议/替代方法? 预先谢谢!!

Suggestions/alternatives to the write issue? Thank you in advance!.

推荐答案

函数to_csv()的压缩参数在写入流时不起作用.您必须分别进行压缩和上传.

The compression parameter of the function to_csv() does not work when writing to a stream. You have to do the zipping and uploading separately.

import gzip
import boto3
from io import BytesIO, TextIOWrapper

buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
    df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('bucket_name', 'key')
s3_object.put(Body=buffer.getvalue())

这篇关于 pandas 数据框上的s3fs gzip压缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆