使用boto3从S3存储桶读取多个csv文件 [英] Reading multiple csv files from S3 bucket with boto3
问题描述
我需要使用python中的boto3从S3存储桶中读取多个csv文件,最后将这些文件合并到熊猫中的单个数据帧中.
I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.
我能够从python中的以下脚本读取单个文件
I am able to read single file from following script in python
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
以下是我的路
files/splittedfiles/Code-345678
在Code-345678
中,我必须读取多个csv
文件并将其组合到大熊猫中的单个数据帧中
In Code-345678
I have multiple csv
files which I have to read and combine it to single dataframe in pandas
此外,我如何将所选Codes
的列表作为列表传递,以便它仅读取那些文件夹.例如
Also, how do I pass a list of selected Codes
as a list,so that it will read those folders only. e.g.
files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682
从上面我只需要阅读下面的代码下的文件.
From above I need to read files under following codes only.
345678,345679,345682
如何在python中做到这一点?
How can I do it in python?
推荐答案
boto3
API不支持一次读取多个对象.您可以做的是检索具有指定前缀的所有对象,并使用循环加载每个返回的对象.为此,可以使用filter()
方法并将Prefix
参数设置为要加载的对象的前缀.下面,我对您的代码进行了简单的更改,使您可以获取所有带有前缀"files/splittedfiles/Code-345678"的对象,这些对象可以通过遍历可将每个文件加载到DataFrame的对象来读取:
The boto3
API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter()
method and set the Prefix
parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
如果您有多个前缀,则要进行评估,可以采用上述内容,并将其转换为以前缀为参数的函数,然后将结果组合在一起.该函数可能是这样的:
If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:
import pandas as pd
def read_prefix_to_df(prefix):
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix=prefix)
prefix_df = []
for obj in prefix_objs:
key = obj.key
body = obj.get()['Body'].read()
df = pd.DataFrame(body)
prefix_df.append(df)
return pd.concat(prefix_df)
然后,您可以将该函数迭代地应用于每个前缀,并在最后合并结果.
Then you can iteratively apply this function to each prefix and combine the results in the end.
这篇关于使用boto3从S3存储桶读取多个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!