Boto3 S3:获取文件而不获取文件夹 [英] Boto3 S3: Get files without getting folders
问题描述
使用boto3,如何在不检索文件夹的情况下检索S3存储桶中的所有文件?
Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders?
请考虑以下文件结构:
file_1.txt
folder_1/
file_2.txt
file_3.txt
folder_2/
folder_3/
file_4.txt
在此示例中,我仅对4个文件感兴趣.
In this example Im only interested in the 4 files.
手动解决方案是:
def count_files_in_folder(prefix):
total = 0
keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)
for key in keys['Contents']:
if key['Key'][-1:] != '/':
total += 1
return total
在这种情况下,总数为4.
In this case total would be 4.
如果我刚刚做过
count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix))
结果将是7个对象(4个文件和3个文件夹):
the result would be 7 objects (4 files and 3 folders):
file.txt
folder_1/
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/
folder_1/folder_2/folder_3/
folder_1/folder_2/folder_3/file_4.txt
我只想要:
file.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt
推荐答案
S3是对象商店.它不将文件/对象存储在目录树下. 新来者总是混淆他们给定的文件夹"选项,实际上这是对象的任意前缀.
S3 is an OBJECT STORE. It DOES NOT store file/object under directories tree. New comer always confuse the "folder" option given by them, which in fact an arbitrary prefix for the object.
对象PREFIX
是一种检索由预定义的修复文件名(关键字)前缀结构组织的对象的方法,例如.
object PREFIX
is a way to retrieve your object organised by predefined fix file name(key) prefix structure, e.g. .
您可以想象使用一个不允许创建目录的文件系统,但是允许您使用斜杠"/"或反斜杠"\"作为分隔符来创建文件名,并且可以将".level"表示为通过通用前缀的文件.
You can imagine using a file system that don't allow you to create a directory, but allow you to create file name with a slash "/" or backslash "\" as delimiter, and you can denote "level" of the file by a common prefix.
因此,在S3中,可以使用以下命令来模拟目录",而不是目录.
Thus in S3, you can use following to "simulate directory" that is not a directory.
folder1-folder2-folder3-myobject
folder1/folder2/folder3/myobject
folder1\folder2\folder3\myobject
如您所见,无论使用哪种任意文件夹分隔符(定界符),对象名称都可以存储在S3中.
As you can see, object name can store inside S3 regardless what kind of arbitrary folder separator(delimiter) you use.
但是,为了帮助用户将批量文件传输到S3,诸如aws cli,s3_transfer api之类的工具会尝试简化步骤并按照输入的本地文件夹结构创建对象名称.
However, to help user to make bulks file transfer to S3, tools such as aws cli, s3_transfer api attempt to simplify the step and create object name follow your input local folder structure.
因此,如果您确定所有S3对象都使用/
或\
作为分隔符,则可以使用S3transfer或AWSCcli之类的工具通过键名进行简单下载.
So if you are sure that all the S3 object is using /
or \
as separator , you can use tools like S3transfer or AWSCcli to make a simple download by using the key name.
这是使用资源迭代器的快速而肮脏的代码.使用s3.resource.object.filter将返回没有与list_objects()/list_objects_v2()相同的1000个键限制的迭代器.
Here is the quick and dirty code using the resource iterator. Using s3.resource.object.filter will return iterator that doesn't have same 1000 keys limit as list_objects()/list_objects_v2().
import os
import boto3
s3 = boto3.resource('s3')
mybucket = s3.Bucket("mybucket")
# if blank prefix is given, return everything)
bucket_prefix="/some/prefix/here"
objs = mybucket.objects.filter(
Prefix = bucket_prefix)
for obj in objs:
path, filename = os.path.split(obj.key)
# boto3 s3 download_file will throw exception if folder not exists
try:
os.makedirs(path)
except FileExistsError:
pass
mybucket.download_file(obj.key, obj.key)
这篇关于Boto3 S3:获取文件而不获取文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!