如何将AWS S3上的文本文件导入 pandas 而无需写入磁盘 [英] How to import a text file on AWS S3 into pandas without writing to disk
本文介绍了如何将AWS S3上的文本文件导入 pandas 而无需写入磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我在S3上保存了一个文本文件,这是一个制表符分隔的表格。我想将它加载到熊猫中,但无法首先保存它,因为我在一台Heroku服务器上运行。这是我到目前为止。
进口io
进口boto3
进口os
import pandas as pd
$ b os.environ [AWS_ACCESS_KEY_ID] =xxxxxxxx
os.environ [AWS_SECRET_ACCESS_KEY] =xxxxxxxx
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket =my_bucket,Key =filename.txt)
file = response [Body]
pd.read_csv(file,header = 14,delimiter =\ t,low_memory = False)
错误是:
OSError:期望的文件路径名或类似文件的对象,有< class '字节' >类型
如何将响应正文转换为pandas将接受的格式?
pd.read_csv(io.StringIO(file),header = 14,delimiter =\ t,low_memory = False)
返回
TypeError:initial_value必须是str或None,而不是StreamingBody
pd.read_csv(io.BytesIO(file),header = 14,delimiter =\\ \\ t,low_memory = False)
返回
TypeError:'StreamingBody'不支持缓冲区接口
更新 - 使用以下工作
file = response [ ()
$ pre> pd.read_csv(io.BytesIO(file),header = 14,delimiter =\ t,low_memory = False)
解决方案
boto
for read_csv
,所以你应该能够:
导入boto
data = pd.read_csv ('s3:/ bucket .... csv')
如果您需要 boto3
因为您位于 python3.4 +
,您可以
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket ='bucket',Key ='key' )
df = pd.read_csv(io.BytesIO(obj ['Body']。read()))
I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io
import boto3
import os
import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]
pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: initial_value must be str or None, not StreamingBody
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
file = response["Body"].read()
and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
解决方案
pandas
uses boto
for read_csv
, so you should be able to:
import boto
data = pd.read_csv('s3:/bucket....csv')
If you need boto3
because you are on python3.4+
, you can
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
这篇关于如何将AWS S3上的文本文件导入 pandas 而无需写入磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文