使用AWS S3 Java将ZipOutputStream上载到S3而不将zip文件(大)临时保存到磁盘 [英] Upload ZipOutputStream to S3 without saving zip file (large) temporary to disk using AWS S3 Java

查看:1536
本文介绍了使用AWS S3 Java将ZipOutputStream上载到S3而不将zip文件(大)临时保存到磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从S3下载照片(不在同一目录中),将其压缩后再使用AWS S3 Java SDK上传到S3.该zip文件大小可以以GB为单位.目前,我正在使用AWS Lambda,它的临时存储限制为最大500 MB.因此,我不想将ZIP文件保存在磁盘上,而是想将ZIP文件(使用从S3下载的照片动态创建的ZIP文件)直接流式传输到S3.我需要使用AWS S3 Java SDK.

I have a requirement to download photos (not in same directory) from S3, ZIP them and again upload to S3 using AWS S3 Java SDK. This zip file size can go in GBs. Currently I am using AWS Lambda which has a limitation of temporary storage up to 500 MB. So I don't want to save ZIP file on disk instead I want to stream ZIP file (which is being created dynamically using downloaded photos from S3) directly to S3. I need this using AWS S3 Java SDK.

推荐答案

基本思想是使用流操作.这样,您就不会等到在文件系统上生成ZIP之后,而是在ZIP算法生成任何数据后立即开始上载.显然,一些数据将被缓冲在内存中,仍然不需要等待整个ZIP在磁盘上生成.我们还将在两个线程中使用流组合和PipedInputStream/PipedOutputStream:一个用于读取数据,另一个用于ZIP内容.

The basic idea is to use streaming operations. This way you won't wait till the ZIP is generated on a filesystem, but start uploading as soon, as the ZIP algorithm produces any data. Obviously, some data will be buffered in memory, still no need to wait for the whole ZIP to be generated on a disk. We'll also use stream compositions and PipedInputStream / PipedOutputStream in two threads: one to read the data, and the other to ZIP the contents.

这是:

final AmazonS3 client = AmazonS3ClientBuilder.defaultClient();

final PipedOutputStream pipedOutputStream = new PipedOutputStream();
final PipedInputStream pipedInputStream = new PipedInputStream(pipedOutputStream);

final Thread s3In = new Thread(() -> {
    try (final ZipOutputStream zipOutputStream = new ZipOutputStream(pipedOutputStream)) {
        S3Objects
                // It's just a convenient way to list all the objects. Replace with you own logic.
                .inBucket(client, "bucket")
                .forEach((S3ObjectSummary objectSummary) -> {
                    try {
                        if (objectSummary.getKey().endsWith(".png")) {
                            System.out.println("Processing " + objectSummary.getKey());

                            final ZipEntry entry = new ZipEntry(
                                    UUID.randomUUID().toString() + ".png" // I'm too lazy to extract file name from the
                                    // objectSummary
                            );

                            zipOutputStream.putNextEntry(entry);

                            IOUtils.copy(
                                    client.getObject(
                                            objectSummary.getBucketName(),
                                            objectSummary.getKey()
                                    ).getObjectContent(),
                                    zipOutputStream
                            );

                            zipOutputStream.closeEntry();
                        }
                    } catch (final Exception all) {
                        all.printStackTrace();
                    }
                });
    } catch (final Exception all) {
        all.printStackTrace();
    }
});
final Thread s3Out = new Thread(() -> {
    try {
        client.putObject(
                "another-bucket",
                "previews.zip",
                pipedInputStream,
                new ObjectMetadata()
        );

        pipedInputStream.close();
    } catch (final Exception all) {
        all.printStackTrace();
    }
});

s3In.start();
s3Out.start();

s3In.join();
s3Out.join();

但是,请注意,它将打印警告:

However, note that it will print a warning:

WARNING: No content length specified for stream data.  Stream contents will be buffered in memory and could result in out of memory errors.

这是因为S3需要在上传之前预先知道数据大小.事先不知道生成的ZIP的大小是不可能的.您可能可以尝试分段上传,但是,代码会更加棘手.不过,想法很相似:一个线程应读取数据并以ZIP流发送内容,而另一个线程应读取ZIP条目并将其作为多部分上传.上传所有条目(部分)后,应完成分段.

That's because S3 needs to know the size of data in advance, before the upload. It's impossible to know the size of a resulting ZIP in advance. You can probably try your luck with multipart uploads, but the code will be more trickier. Though, the idea would be similar: one thread should read the data and send the content in ZIP stream and the other thread should read ZIPped entries and upload them as multiparts. After all the entries (parts) are uploaded, the multipart should be completed.

这里是

它遭受了同样的困扰:在上传之前,需要在内存中准备ZIP.

It suffers from the same plague: the ZIP needs to be prepared in memory before the upload.

如果您有兴趣,我准备了一个演示项目,因此您可以玩代码.

If you're interested, I've prepared a demo project, so you can play with the code.

这篇关于使用AWS S3 Java将ZipOutputStream上载到S3而不将zip文件(大)临时保存到磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆