从 boto3 检索 S3 存储桶中的子文件夹名称 [英] Retrieving subfolders names in S3 bucket from boto3
问题描述
使用 boto3,我可以访问我的 AWS S3 存储桶:
Using boto3, I can access my AWS S3 bucket:
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket-name')
现在,存储桶包含文件夹first-level
,该文件夹本身包含多个以时间戳命名的子文件夹,例如1456753904534
.我需要知道我正在做的另一项工作的这些子文件夹的名称,我想知道是否可以让 boto3 为我检索这些.
Now, the bucket contains folder first-level
, which itself contains several sub-folders named with a timestamp, for instance 1456753904534
.
I need to know the name of these sub-folders for another job I'm doing and I wonder whether I could have boto3 retrieve those for me.
所以我尝试了:
objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')
它给出了一个字典,它的键Contents"给了我所有的三级文件而不是二级时间戳目录,实际上我得到了一个包含内容的列表
which gives a dictionary, whose key 'Contents' gives me all the third-level files instead of the second-level timestamp directories, in fact I get a list containing things as
{u'ETag': '"etag"', u'Key': first-level/1456753904534/part-00014', u'LastModified':datetime.datetime(2016, 2, 29, 13, 52, 24, tzinfo=tzutc()),
u'Owner': {u'DisplayName': 'owner', u'ID':'id'},
u'Size':大小,u'StorageClass':'storageclass'}
{u'ETag': '"etag"', u'Key': first-level/1456753904534/part-00014', u'LastModified': datetime.datetime(2016, 2, 29, 13, 52, 24, tzinfo=tzutc()),
u'Owner': {u'DisplayName': 'owner', u'ID': 'id'},
u'Size': size, u'StorageClass': 'storageclass'}
你可以看到特定的文件,在这种情况下 part-00014
被检索,而我想单独获取目录的名称.原则上我可以从所有路径中去除目录名称,但是在第三级检索所有内容以获得第二级既丑陋又昂贵!
you can see that the specific files, in this case part-00014
are retrieved, while I'd like to get the name of the directory alone.
In principle I could strip out the directory name from all the paths but it's ugly and expensive to retrieve everything at third level to get the second level!
我还尝试了此处的报告:
for o in bucket.objects.filter(Delimiter='/'):
print(o.key)
但我没有获得所需级别的文件夹.
but I do not get the folders at the desired level.
有没有办法解决这个问题?
Is there a way to solve this?
推荐答案
S3 是一个对象存储,它没有真正的目录结构./"比较美观.人们想要一个目录结构的一个原因,因为他们可以维护/修剪/向应用程序添加一棵树.对于 S3,您将此类结构视为一种索引或搜索标签.
S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic. One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.
要在 S3 中操作对象,您需要 boto3.client 或 boto3.resource,例如列出所有对象
To manipulate object in S3, you need boto3.client or boto3.resource, e.g. To list all object
import boto3
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')
http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects
事实上,如果s3对象名是使用'/'分隔符存储的.最新版本的 list_objects (list_objects_v2) 允许您将响应限制为以指定前缀开头的键.
In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.
要将项目限制为某些子文件夹下的项目:
To limit the items to items under certain sub-folders:
import boto3
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=BUCKET,
Prefix ='DIR1/DIR2',
MaxKeys=100 )
另一种选择是使用 python os.path 函数来提取文件夹前缀.问题是这需要从不需要的目录中列出对象.
Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.
import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key)
foldername = os.path.dirname(s3_key)
# if you are not using conventional delimiter like '#'
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]
关于 boto3 的提醒:boto3.resource 是一个很好的高级 API.使用 boto3.client 和 boto3.resource 各有利弊.如果你开发内部共享库,使用 boto3.resource 会给你一个黑盒层,覆盖所使用的资源.
A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.
这篇关于从 boto3 检索 S3 存储桶中的子文件夹名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!