遍历具有文件夹结构的S3存储桶中的文件 [英] Iterate over files in an S3 bucket with folder structure

查看:385
本文介绍了遍历具有文件夹结构的S3存储桶中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个S3存储桶.在存储桶中,我们有一个用于年份(2018)的文件夹,以及我们每个月和一天收集的一些文件.因此,例如,2018 \ 3 \ 24、2018 \ 3 \ 25等.

I have an S3 bucket. Inside the bucket, we have a folder for the year, 2018, and some files we have collected for each month and day. So, as an example, 2018\3\24, 2018\3\25 so forth and so on.

我们没有将日期放入每天时段的文件中.

We didn't put the dates in the files inside each days bucket.

基本上,我想遍历存储桶并使用文件夹结构按文件的日期"对每个文件进行分类,因为我们需要将其加载到其他数据库中,并且需要一种识别方式.

Basically, I want to iterate through the bucket and use the folders structure to classify each file by it's 'date' since we need to load it into a different database and will need a way to identify.

关于boto3的使用,我已经阅读了大量的文章,但是在迭代过程中,关于是否可以完成我所需要的细节似乎有些矛盾.

I've read a ton of posts on using boto3, and iterating through however there seem to be conflicting details on if what I need can be done.

如果有更简便的方法,请提出建议.

If there's an easier way of doing this please suggest.

我把它弄近了 导入boto3

I got it close import boto3

s3client = boto3.client('s3')
bucket = 'bucketname'
startAfter = '2018'

s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in s3objects['Contents']:
    print(object['Key'])

推荐答案

使用boto3时,每个请求只能列出1000个对象.因此,要获取存储桶中的所有对象,可以使用s3的分页器.

When using boto3 you can only list 1000 objects per request. So to obtain all the objects in the bucket, you can use s3's paginator.

client.get_paginator('list_objects_v2')是您所需要的.

您需要这样的东西

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucketname',StartAfter='2018')
for page in result:
    if "Contents" in page:
        for key in page[ "Contents" ]:
            keyString = key[ "Key" ]
            print keyString

来自文档:

列表对象:

返回存储桶中的部分或全部(最多1000个)对象.你可以 使用请求参数作为选择标准以返回的子集 桶中的物体.

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.

list_objects_v2:

返回存储桶中的部分或全部(最多1000个)对象.你可以 使用请求参数作为选择标准以返回的子集 桶中的物体.注意:ListObjectsV2是修订后的List 对象API,我们建议您将此修订版API用于新的 应用程序开发.

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. Note: ListObjectsV2 is the revised List Objects API and we recommend you use this revised API for new application development.

来自答案:

list_objects_v2添加了功能.由于每页列出了1000个键,因此使用 列出多个页面的标记可能令人头疼.从逻辑上讲,您需要 跟踪您成功处理的最后一个密钥.和 ContinuationToken,您不需要知道最后一个键,只需检查 响应中存在NextContinuationToken.你可以产生 并行处理,可处理1000个密钥中的多个而不处理 最后一个键来获取下一页.

list_objects_v2 has added features. Due to the 1000 keys per page listing limits, using marker to list multiple pages can be an headache. Logically, you need to keep track the last key you successfully processed. With ContinuationToken, you don't need to know the last key, you just check existence of NextContinuationToken in the response. You can spawn parallel process to deal with multiple of 1000 keys without dealing with the last key to fetch next page.

这篇关于遍历具有文件夹结构的S3存储桶中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆