使用 boto3 从 S3 存储桶中读取多个 csv 文件 [英] Reading multiple csv files from S3 bucket with boto3

查看:43
本文介绍了使用 boto3 从 S3 存储桶中读取多个 csv 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 python 中的 boto3 从 S3 存储桶中读取多个 csv 文件,最后将这些文件合并到 Pandas 中的单个数据帧中.

I need to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.

我能够从 python 中的以下脚本读取单个文件

I am able to read single file from following script in python

 s3 = boto3.resource('s3')
 bucket = s3.Bucket('test-bucket')
 for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()

以下是我的路

 files/splittedfiles/Code-345678

Code-345678 中,我有多个 csv 文件,我必须读取这些文件并将其合并到 Pandas 中的单个数据帧

In Code-345678 I have multiple csv files which I have to read and combine it to single dataframe in pandas

此外,如何将选定的 Codes 列表作为列表传递,以便它仅读取这些文件夹.例如

Also, how do I pass a list of selected Codes as a list,so that it will read those folders only. e.g.

files/splittedfiles/Code-345678
files/splittedfiles/Code-345679
files/splittedfiles/Code-345680
files/splittedfiles/Code-345681
files/splittedfiles/Code-345682

从上面我只需要读取以下代码下的文件.

From above I need to read files under following codes only.

345678,345679,345682

我如何在 python 中做到这一点?

How can I do it in python?

推荐答案

boto3 API 不支持一次读取多个对象.您可以做的是检索具有指定前缀的所有对象,并使用循环加载每个返回的对象.为此,您可以使用 filter() 方法并将 Prefix 参数设置为要加载的对象的前缀.下面我对您的代码进行了这个简单的更改,它可以让您获得所有带有前缀files/splittedfiles/Code-345678"的对象,您可以通过循环这些对象来读取这些对象,在这些对象中您可以将每个文件加载到 DataFrame 中:

The boto3 API does not support reading multiple objects at once. What you can do is retrieve all objects with a specified prefix and load each of the returned objects with a loop. To do this you can use the filter() method and set the Prefix parameter to the prefix of the objects you want to load. Below I've made this simple change to your code that will let you get all the objects with the prefix "files/splittedfiles/Code-345678" that you can read by looping through those objects where you can load each file into a DataFrame:

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
prefix_objs = bucket.objects.filter(Prefix="files/splittedfiles/Code-345678")
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()

如果您有多个要计算的前缀,您可以将上述内容转换为一个函数,其中前缀是一个参数,然后将结果组合在一起.该函数可能像这样:

If you have multiple prefixes you are going to want to evaluate you can take the above and turn it into a function where the prefix is a parameter then combine the results together. The function could like something like this:

import pandas as pd

def read_prefix_to_df(prefix):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('test-bucket')
    prefix_objs = bucket.objects.filter(Prefix=prefix)
    prefix_df = []
    for obj in prefix_objs:
        key = obj.key
        body = obj.get()['Body'].read()
        df = pd.DataFrame(body)
        prefix_df.append(df)
    return pd.concat(prefix_df)

然后你可以迭代地把这个函数应用到每个前缀上,最后把结果组合起来.

Then you can iteratively apply this function to each prefix and combine the results in the end.

这篇关于使用 boto3 从 S3 存储桶中读取多个 csv 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆