AWS Lambda:如何在一个S3存储桶中提取一个tgz文件并将其放入另一个S3存储桶中 [英] AWS Lambda: How to extract a tgz file in a S3 bucket and put it in another S3 bucket

查看:118
本文介绍了AWS Lambda:如何在一个S3存储桶中提取一个tgz文件并将其放入另一个S3存储桶中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为"Source"的S3存储桶.许多".tgz"文件被实时推送到该存储桶中.我编写了Java代码来提取".tgz"文件并将其推入目标"存储桶.我将代码推送为Lambda函数.我在Java代码中将'.tgz'文件作为InputStream获得.如何在Lambda中提取它?我无法在Lambda中创建文件,它在JAVA中引发了"FileNotFound(Permission Denied)".

I have an S3 bucket named "Source". Many '.tgz' files are being pushed into that bucket in real-time. I wrote an Java code for extracting the '.tgz' file and pushing it into "Destination" bucket. I pushed my code as Lambda function. I got the '.tgz' file as InputStream in my Java code. How to extract it in Lambda ? I'm not able to create a file in Lambda, it throws "FileNotFound(Permission Denied)" in JAVA.

AmazonS3 s3Client = new AmazonS3Client();
S3Object s3Object = s3Client.getObject(new GetObjectRequest(srcBucket, srcKey));
InputStream objectData = s3Object.getObjectContent();
File file = new File(s3Object.getKey());
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file)); <--- It throws FileNotFound(Permission denied) here

推荐答案

由于其中一个响应是在Python中提供的,因此我以这种语言提供了替代解决方案.

Since one of the responses was in Python i provide alternative solution in this language.

使用/tmp 文件系统的解决方案存在的问题是,AWS仅允许在其中存储 512 MB ( BytesIO 类,仅在内存中处理文件内容. AWS允许为Lambda分配多达3GB的RAM,这极大地扩展了最大文件大小.我成功测试了1GB S3文件的解压缩.

Problem with the solution using /tmp file-system is, that AWS allows to store only 512 MB there (read more). In order to untar or unzip larger files it's better to use io package and BytesIO class and process file contents purely in memory. AWS allows to assign up to 3GB of RAM to a Lambda and this extends max file size significantly. I successfully tested untar'ing with 1GB S3 file.

在我的案例中,将2000个文件从1GB的tar文件解压缩到另一个S3存储桶需要140秒.通过利用多个线程将未配置文件上传到目标S3存储桶,可以进一步对其进行优化.

In my case un-taring of ~2000 files from 1GB tar-file to another S3 bucket took 140 seconds. It can by further optimized by utilizing multiple threads for uploading un-tarred files to target S3 bucket.

下面的示例代码提供了单线程解决方案:

Example code below present single-threaded solution:

import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(bytes_content), Bucket = bucket, Key = tar_resource.name)

这篇关于AWS Lambda:如何在一个S3存储桶中提取一个tgz文件并将其放入另一个S3存储桶中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆