使用boto3从S3存储桶读取多个csv文件 [英] Reading multiple csv files from S3 bucket with boto3

查看:411
本文介绍了使用boto3从S3存储桶读取多个csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用python中的boto3从S3存储桶中读取多个csv文件,最后将这些文件合并到熊猫中的单个数据帧中.

I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.

我能够从python中的以下脚本读取单个文件

I am able to read single file from following script in python

 s3 = boto3.resource('s3')
 bucket = s3.Bucket('test-bucket')
 for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()

以下是我的路

 files/splittedfiles/Code-345678

Code-345678中,我必须读取多个csv文件并将其组合到大熊猫中的单个数据帧中

In Code-345678 I have multiple csv files which I have to read and combine it to single dataframe in pandas

此外,我如何将所选Codes的列表作为列表传递,以便它仅读取那些文件夹.例如

Also, how do I pass a list of selected Codes as a list,so that it will read those folders only. e.g.

files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682

从上面我只需要阅读下面的代码下的文件.

From above I need to read files under following codes only.

345678,345679,345682

如何在python中做到这一点?

How can I do it in python?

推荐答案

boto3 API不支持一次读取多个对象.您可以做的是检索具有指定前缀的所有对象,并使用循环加载每个返回的对象.为此,可以使用filter()方法并将Prefix参数设置为要加载的对象的前缀.下面,我对您的代码进行了简单的更改,使您可以获取所有带有前缀"files/splittedfiles/Code-345678"的对象,这些对象可以通过遍历可将每个文件加载到DataFrame的对象来读取:

The boto3 API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter() method and set the Prefix parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()

如果您有多个前缀,则要进行评估,可以采用上述内容,并将其转换为以前缀为参数的函数,然后将结果组合在一起.该函数可能是这样的:

If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:

import pandas as pd

def read_prefix_to_df(prefix):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('test-bucket')
    prefix_objs = bucket.objects.filter(Prefix=prefix)
    prefix_df = []
    for obj in prefix_objs:
        key = obj.key
        body = obj.get()['Body'].read()
        df = pd.DataFrame(body)
        prefix_df.append(df)
    return pd.concat(prefix_df)

然后,您可以将该函数迭代地应用于每个前缀,并在最后合并结果.

Then you can iteratively apply this function to each prefix and combine the results in the end.

这篇关于使用boto3从S3存储桶读取多个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆