将 S3 文件分组到子文件夹中 [英] Grouping S3 Files Into Subfolders

查看:38
本文介绍了将 S3 文件分组到子文件夹中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个管道,可以移动大约 1 TB 的数据,所有 CSV 文件.在这个管道中有数百个不同名称的文件.它们有一个日期组件,它是自动分区的.我的问题是如何使用CDK根据文件名自动创建子文件夹.换句话说,数据属于广泛的类别,但我们的数据科学家需要更详细的数据.

I have a pipeline that moves approximately 1 TB of data, all CSV files. In this pipeline there are hundreds of files with different names. They have a date component, which is automatically partitioned. My question is how to use the CDK to automatically create subfolders based on the name of the file. In other words, the data comes in as broad category, but our data scientists need it at one more level of detail.

推荐答案

您似乎需要根据文件名 (Key) 中的信息将传入对象移动到文件夹中.

It appears that your requirement is to move incoming objects into folders based on information in their filename (Key).

这可以通过在 Amazon S3 存储桶上添加一个触发器来实现,该触发器在创建新对象时触发 AWS Lambda 函数.

This could be done by adding a trigger on the Amazon S3 bucket that triggers an AWS Lambda function when a new object is created.

这是使用 Amazon S3 根据文件名移动文件中的一些代码:

import boto3
import urllib

def lambda_handler(event, context):
    
    # Get the bucket and object key from the Event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
    
    # Only copy objects that were uploaded to the bucket root (to avoid an infinite loop)
    if '/' not in key:
        
        # Determine destination directory based on Key
        directory = key # Your logic goes here to extract the directory name
      
        # Copy object
        s3_client = boto3.client('s3')
        s3_client.copy_object(
            Bucket = bucket,
            Key = f"{directory}/{key}",
            CopySource= {'Bucket': bucket, 'Key': key}
        )
        
        # Delete source object
        s3_client.delete_object(
            Bucket = bucket,
            Key = key
        )

您需要修改根据新对象的 key 确定目标目录名称的代码.

You would need to modify the code that determines the name of the destination directory based on the key of the new object.

它还假设新对象将进入存储桶的顶级(根),然后移动到子目录中.相反,如果新对象进入给定路径(例如 incoming/),则仅设置 S3 触发器以对该路径进行操作并删除 if '/' not in key代码>逻辑.

It also assumes that new objects will come into the top-level (root) of the bucket and then be moved into sub-directories. If, instead, new objects are coming in a given path (eg incoming/) then only set the S3 trigger to operate on that path and remove the if '/' not in key logic.

这篇关于将 S3 文件分组到子文件夹中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆