自动在S3中提取.gz文件 [英] Extract .gz files in S3 automatically
问题描述
我正在尝试寻找一种解决方案,以将.gz格式的ALB日志文件从ALB自动上传到S3.
我的存储桶结构是这样的
/log-bucket
..alb-1/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-2/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-3/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
基本上,每5分钟,每个ALB都会自动将日志推送到对应的S3存储桶.我当时想在同一个存储桶中提取新的.gz文件.
有什么办法可以解决这个问题?
我注意到我们可以使用Lambda函数,但不确定从哪里开始.示例代码将不胜感激!
您最好的选择可能是让AWS Lambda函数订阅S3事件.每当创建新对象时,都会触发此Lambda函数.然后,Lambda函数可以从S3读取文件,提取文件,将提取的数据写回到S3并删除原始文件.
也就是说,如果您确实需要在S3中存储未压缩的日志,则可能还需要重新考虑.压缩文件不仅便宜,因为它们不占用未压缩文件的存储空间,而且通常处理起来也更快,因为在大多数情况下,瓶颈在于传输数据的网络带宽和不可用的CPU资源.减压.大多数工具还支持直接使用压缩文件.采用Amazon Athena(压缩格式)或Amazon EMR (例如,如何处理压缩文件).>
I'm trying to find a solution to extract ALB logs file in .gz format when they're uploaded automatically from ALB to S3.
My bucket structure is like this
/log-bucket
..alb-1/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-2/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
..alb-3/AWSLogs/account-number/elasticloadbalancing/ap-northeast-1/2018/log.gz
Basically, every 5 minutes, each ALB would automatically push logs to correspond S3 bucket. I'd like to extract new .gz files right at that time in same bucket.
Is there any ways to handle this?
I noticed that we can use Lambda function but not sure where to start. A sample code would be greatly appreciated!
Your best choice would probably be to have an AWS Lambda function subscribed to S3 events. Whenever a new object gets created, this Lambda function would be triggered. The Lambda function could then read the file from S3, extract it, write the extracted data back to S3 and delete the original one.
How that works is described in Using AWS Lambda with Amazon S3.
That said, you might also want to reconsider if you really need to store uncompressed logs in S3. Compressed files are not only cheaper, as they don't take as much storage space as uncompressed ones, but they are usually also faster to process, as the bottleneck in most cases is network bandwidth for transferring the data and not available CPU-resources for decompression. Most tools also support working directly with compressed files. Take Amazon Athena (Compression Formats) or Amazon EMR (How to Process Compressed Files) for example.
这篇关于自动在S3中提取.gz文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!