如何使用Boto3将S3中的gzip压缩拼花文件读取到Python中? [英] How do I read a gzipped parquet file from S3 into Python using Boto3?

查看:97
本文介绍了如何使用Boto3将S3中的gzip压缩拼花文件读取到Python中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的S3存储桶中有一个名为data.parquet.gzip的文件.我无法弄清楚阅读该书有什么问题.通常我已经使用StringIO了,但是我不知道如何解决它.我想使用pandas和boto3将其从S3导入到我的Python jupyter笔记本会话中.

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.

推荐答案

该解决方案实际上非常简单.

The solution is actually quite straightforward.

import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets

# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
                  aws_access_key_id=ACCESS_KEY_HERE,
                  aws_secret_access_key=SECRET_ACCESS_KEY_HERE)

# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)

# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()

# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))

这篇关于如何使用Boto3将S3中的gzip压缩拼花文件读取到Python中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆