每当将新数据加载到s3存储桶中时,如何触发python脚本? [英] How to trigger a python script whenever new data is loaded into a s3 bucket?

查看:114
本文介绍了每当将新数据加载到s3存储桶中时,如何触发python脚本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从s3存储桶中提取数据,该存储桶会在第二秒之前获取新记录.数据以每小时250+ G的速度传入.我正在创建一个Python脚本,该脚本将连续运行以在之前实时收集新的数据负载.

I am trying to pull down data from an s3 bucket that gets new records by the second. Data comes in at 250+ G per hour. I am creating a Python script that will be running continuously to collect new data loads in real-time by the seconds.

这是s3存储桶键的结构:

Here is the structure of the s3 bucket keys:

o_key=7111/year=2020/month=8/day=11/hour=16/minute=46/second=9/ee9.jsonl.gz
o_key=7111/year=2020/month=8/day=11/hour=16/minute=40/second=1/ee99999.jsonl.gz

我正在使用Boto3尝试尝试此操作,这是我到目前为止所拥有的:

I am using Boto3 to try and attempt this and here is what I have so far:

s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, verify=False)
s3_bucket = s3_resource.Bucket(BUCKET_NAME)
files = s3_bucket.objects.filter()
files = [obj.key for obj in sorted(files, key=lambda x: x.last_modified, reverse=True)]
for x in files:
    print(x)

这将输出该存储桶中的所有键,并按last_modified数据进行排序.但是,是否有一种方法可以暂停脚本,直到加载新数据,然后再处理该数据等等?加载新数据时可能会有20秒的延迟,这是在形成逻辑时给我带来麻烦的另一件事.任何想法或建议都会有所帮助.

This outputs all the keys that are in that bucket and sorts by the last_modified data. However is there a way to pause the script until new data is loaded and then process that data and so on by the second? There could be 20 second delays when loaded in the new data so that is another thing that is giving me troubles when forming the logic. Any ideas or suggestions would help.

 s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, verify=False)
 s3_bucket = s3_resource.Bucket(BUCKET_NAME)

 files = s3_bucket.objects.filter()
 while list(files): #check if the key exists
         if len(objs) > 0 and objs[0].key == key:
                   print("Exists!")
         else:
               time.sleep(.1) #sleep until the next key is there
               continue 

这是我尝试过的另一种方法,但效果不佳.我试图在没有下一个数据的时候入睡,然后在加载新数据后对其进行处理.

This is another approach i tried but isn't working to well. I am trying to sleep whenever there is no next data and then process the new data once it is loaded.

推荐答案

Amazon S3通知功能使您可以在存储桶中发生某些事件时接收通知.要启用通知,您必须首先添加一个通知配置,该配置标识您要Amazon S3发布的事件以及您要Amazon S3发送通知的目的地. 将此配置存储在与存储桶关联的通知子资源中. -通常在Lambda中...

The Amazon S3 notification feature enables you to receive notifications when certain events happen in your bucket. To enable notifications, you must first add a notification configuration that identifies the events you want Amazon S3 to publish and the destinations where you want Amazon S3 to send the notifications. You store this configuration in the notification subresource that is associated with a bucket. - Typically in Lambda...

https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

希望这对您有帮助
r0ck

Hope this helps
r0ck

这篇关于每当将新数据加载到s3存储桶中时,如何触发python脚本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆