新文件到达AWS S3后触发AWS Lambda [英] Triggering AWS Lambda on arrival of new files in AWS S3

查看:539
本文介绍了新文件到达AWS S3后触发AWS Lambda的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用Python编写的Lambda函数,该函数具有为来自AWS S3中3个文件的3个表运行Redshift复制命令的代码。

I have a Lambda function written in Python, which has the code to run Redshift copy commands for 3 tables from 3 files located in AWS S3.

示例:

我有表A,B和C。

The python code contains:

'copy to redshift A from "s3://bucket/abc/A.csv"'
'copy to redshift B from "s3://bucket/abc/B.csv"'
'copy to redshift C from "s3://bucket/abc/C.csv"'

只要三个文件中的新文件到达S3中的 s3:// bucket / abc / 位置,就会触发此代码。因此,即使只有一个csv文件到达,它也会加载所有三个表。

This code is triggered whenever a new file among the three arrives at "s3://bucket/abc/" location in S3. So, it loads all the three tables even if only one csv file has arrived.

最佳情况解决方案:将代码分解为三个不同的Lambda函数,然后直接将它们映射到每个源文件都会更新/上传。

Best case solution: Break down the code into three different Lambda function and directly map them to each source files update/upload.

但是,我的要求是继续执行一个Lambda代码,该代码将有选择地仅运行其中一部分(使用if)

But, my requirement is to go ahead with a single Lambda code, which will selectively run a part of it (using if) for only those csv files which got updated.

示例:

if (new csv file for A has arrived):
    'copy to redshift A from "s3://bucket/abc/A.csv"'
if (new csv file for B has arrived):
    'copy to redshift B from "s3://bucket/abc/B.csv"'
if (new csv file for C has arrived):
    'copy to redshift C from "s3://bucket/abc/C.csv"'

当前,为了实现这一点,我正在存储这些python dict中文件的元数据(LastModified),文件名为键。打印字典将是这样的:

Currently, to achieve this, I am storing those files' metadata (LastModified) in a python dict with the file names being the key. Printing the dict would be something like this:

{'bucket/abc/A.csv': '2019-04-17 11:14:11+00:00', 'bucket/abc/B.csv': '2019-04-18 12:55:47+00:00', 'bucket/abc/C.csv': '2019-04-17 11:09:55+00:00'}

然后,每当新文件出现在这三个文件中的任何一个中,都会触发Lambda,我正在阅读字典,并将每个文件的时间与字典中的相应值进行比较,如果增加了新的LastModified,则运行该表的copy命令。

And then, whenever a new file appears among anyone of the three, Lambda is triggered and I'm reading the dict and comparing the times of the each file with the respective values in the dict, if the new LastModified is increased, I'm running that table's copy command.

所有这些都是因为对于这种用例,我无法通过S3 event / CloudWatch找到解决方法。

All these, because there is no work around I could find with S3 event/CloudWatch for this kind of use-case.

请问其他问题,如果不能很好地解决问题。

Please ask further questions, if the problem couldn't be articulated well.

推荐答案

在Amazon S3事件中触发一个AWS Lambda函数,它在事件中作为提供存储桶名称和对象键

When an Amazon S3 Event triggers an AWS Lambda function, it provides the Bucket name and Object key as part of the event:

def lambda_handler(event, context):

  # Get the bucket and object key from the Event
  bucket = event['Records'][0]['s3']['bucket']['name']
  key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])

虽然对象详细信息作为列表传递,但我怀疑每个事件仅提供了一个对象(因此使用 [0] )。但是,我不是100%肯定会一直如此。最好假设它,直到得到证明为止。

While the object details as passed as a list, I suspect that each event is only ever supplied with one object (hence the use of [0]). However, I'm not 100% certain that this will always be the case. Best to assume it until proven otherwise.

因此,如果您的代码需要特定的对象,则您的代码将是:

Thus, if your code is expecting specific objects, your code would be:

if key == 'abc/A.csv':
    'copy to Table-A from "s3://bucket/abc/A.csv"'
if key == 'abc/B.csv':
    'copy to Table-B from "s3://bucket/abc/B.csv"'
if key == 'abc/C.csv':
    'copy to Table-C from "s3://bucket/abc/C.csv"'

无需存储 LastModified ,因为每当上传新文件时都会触发该事件。另外,请注意将数据存储在全局字典中,并希望在将来的执行中将其保留-并非总是如此。如果Lambda容器在一段时间内未运行,则可以将其删除;如果并发执行,则可以创建其他Lambda容器。

There is no need to store LastModified, since the event is triggered whenever a new file is uploaded. Also, be careful about storing data in a global dict and expecting it to be around at a future execution — this will not always be the case. A Lambda container can be removed if it does not run for a period of time, and additional Lambda containers might be created if there is concurrent execution.

如果您始终知道您期望有3个文件,并且它们总是按一定顺序上传 ,那么您可以使用第3个文件的上传触发该过程,然后将所有3个文件复制到Redshift。

If you always know that you are expecting 3 files and they are always uploaded in a certain order, then you could instead use the upload of the 3rd file to trigger the process, which would then copy all 3 files to Redshift.

这篇关于新文件到达AWS S3后触发AWS Lambda的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆