从boto3检索S3存储桶中的子文件夹名称 [英] Retrieving subfolders names in S3 bucket from boto3

查看:731
本文介绍了从boto3检索S3存储桶中的子文件夹名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用boto3,我可以访问我的AWS S3存储桶:

Using boto3, I can access my AWS S3 bucket:

s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')

现在,存储桶包含文件夹first-level,该文件夹本身包含几个带有时间戳的子文件夹,例如1456753904534. 我需要知道这些子文件夹的名称以用于我正在做的另一项工作,我想知道是否可以让boto3为我检索这些子文件夹.

Now, the bucket contains folder first-level, which itself contains several sub-folders named with a timestamp, for instance 1456753904534. I need to know the name of these sub-folders for another job I'm doing and I wonder whether I could have boto3 retrieve those for me.

所以我尝试了:

objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')

这提供了一个字典,其键目录"为我提供了所有第三级文件,而不是第二级时间戳目录,实际上,我得到的列表中包含

which gives a dictionary, whose key 'Contents' gives me all the third-level files instead of the second-level timestamp directories, in fact I get a list containing things as

{u'ETag':'"etag"',u'Key':一级/1456753904534/part-00014',u'LastModified': datetime.datetime(2016,2,29,13,52,24,tzinfo = tzutc()),
u'所有者':{u'DisplayName':'所有者',u'ID': 'id'},
u'Size':大小,u'StorageClass':'storageclass'}

{u'ETag': '"etag"', u'Key': first-level/1456753904534/part-00014', u'LastModified': datetime.datetime(2016, 2, 29, 13, 52, 24, tzinfo=tzutc()),
u'Owner': {u'DisplayName': 'owner', u'ID': 'id'},
u'Size': size, u'StorageClass': 'storageclass'}

您可以看到已检索到特定文件(在本例中为part-00014),而我想单独获取目录的名称. 原则上,我可以从所有路径中删除目录名称,但是要获取第三级的所有内容以获取第二级是很丑陋且昂贵的!

you can see that the specific files, in this case part-00014 are retrieved, while I'd like to get the name of the directory alone. In principle I could strip out the directory name from all the paths but it's ugly and expensive to retrieve everything at third level to get the second level!

我还尝试了一些在此处报告的内容:

I also tried something reported here:

for o in bucket.objects.filter(Delimiter='/'):
    print(o.key)

但是我没有得到所需级别的文件夹.

but I do not get the folders at the desired level.

有没有办法解决这个问题?

Is there a way to solve this?

推荐答案

S3是一个对象存储,它没有真实的目录结构. "/"是很漂亮的. 人们之所以拥有目录结构是因为他们可以维护/修剪/向应用程序添加树.对于S3,您将这种结构视为索引或搜索标签的一种.

S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic. One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.

要在S3中操作对象,您需要boto3.client或boto3.resource,例如 列出所有对象

To manipulate object in S3, you need boto3.client or boto3.resource, e.g. To list all object

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name') 

http://boto3.readthedocs. org/en/latest/reference/services/s3.html#S3.Client.list_objects

实际上,如果s3对象名称是使用'/'分隔符存储的.最新版本的list_objects(list_objects_v2)允许您将响应限制为以指定前缀开头的键.

In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.

要将项目限制为某些子文件夹下的项目:

To limit the items to items under certain sub-folders:

    import boto3 
    s3 = boto3.client("s3")
    response = s3.list_objects_v2(
            Bucket=BUCKET,
            Prefix ='DIR1/DIR2',
            MaxKeys=100 )

文档

另一个选择是使用python os.path函数提取文件夹前缀.问题是,这将需要列出不需要目录中的对象.

Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key) 
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#' 
s3_key = 'first-level#1456753904534#part-00014
filename = s3_key.split("#")[-1]

关于boto3的提醒:boto3.resource是一个不错的高级API.使用boto3.client与boto3.resource有优缺点.如果您开发内部共享库,则使用boto3.resource将在使用的资源上为您提供一个黑盒层.

A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.

这篇关于从boto3检索S3存储桶中的子文件夹名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆