使用 AWS S3 Java 将 ZipOutputStream 上传到 S3,而无需将 zip 文件(大)临时保存到磁盘 [英] Upload ZipOutputStream to S3 without saving zip file (large) temporary to disk using AWS S3 Java

查看:79
本文介绍了使用 AWS S3 Java 将 ZipOutputStream 上传到 S3,而无需将 zip 文件(大)临时保存到磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从 S3 下载照片(不在同一目录中),压缩它们,然后使用 AWS S3 Java SDK 再次上传到 S3.此 zip 文件大小可以以 GB 为单位.目前我正在使用 AWS Lambda,它的临时存储限制为 500 MB.所以我不想将 ZIP 文件保存在磁盘上,而是想将 ZIP 文件(使用从 S3 下载的照片动态创建)直接流式传输到 S3.我需要这个使用 AWS S3 Java SDK.

I have a requirement to download photos (not in same directory) from S3, ZIP them and again upload to S3 using AWS S3 Java SDK. This zip file size can go in GBs. Currently I am using AWS Lambda which has a limitation of temporary storage up to 500 MB. So I don't want to save ZIP file on disk instead I want to stream ZIP file (which is being created dynamically using downloaded photos from S3) directly to S3. I need this using AWS S3 Java SDK.

推荐答案

基本思想是使用流式操作.这样你就不会等到 ZIP 在文件系统上生成,而是尽快开始上传,因为 ZIP 算法会产生任何数据.很明显,一些数据会缓存在内存中,仍然不需要等待整个 ZIP 在磁盘上生成.我们还将在两个线程中使用流组合和 PipedInputStream/PipedOutputStream:一个用于读取数据,另一个用于压缩内容.

The basic idea is to use streaming operations. This way you won't wait till the ZIP is generated on a filesystem, but start uploading as soon, as the ZIP algorithm produces any data. Obviously, some data will be buffered in memory, still no need to wait for the whole ZIP to be generated on a disk. We'll also use stream compositions and PipedInputStream / PipedOutputStream in two threads: one to read the data, and the other to ZIP the contents.

这是 :

final AmazonS3 client = AmazonS3ClientBuilder.defaultClient();

final PipedOutputStream pipedOutputStream = new PipedOutputStream();
final PipedInputStream pipedInputStream = new PipedInputStream(pipedOutputStream);

final Thread s3In = new Thread(() -> {
    try (final ZipOutputStream zipOutputStream = new ZipOutputStream(pipedOutputStream)) {
        S3Objects
                // It's just a convenient way to list all the objects. Replace with you own logic.
                .inBucket(client, "bucket")
                .forEach((S3ObjectSummary objectSummary) -> {
                    try {
                        if (objectSummary.getKey().endsWith(".png")) {
                            System.out.println("Processing " + objectSummary.getKey());

                            final ZipEntry entry = new ZipEntry(
                                    UUID.randomUUID().toString() + ".png" // I'm too lazy to extract file name from the
                                    // objectSummary
                            );

                            zipOutputStream.putNextEntry(entry);

                            IOUtils.copy(
                                    client.getObject(
                                            objectSummary.getBucketName(),
                                            objectSummary.getKey()
                                    ).getObjectContent(),
                                    zipOutputStream
                            );

                            zipOutputStream.closeEntry();
                        }
                    } catch (final Exception all) {
                        all.printStackTrace();
                    }
                });
    } catch (final Exception all) {
        all.printStackTrace();
    }
});
final Thread s3Out = new Thread(() -> {
    try {
        client.putObject(
                "another-bucket",
                "previews.zip",
                pipedInputStream,
                new ObjectMetadata()
        );

        pipedInputStream.close();
    } catch (final Exception all) {
        all.printStackTrace();
    }
});

s3In.start();
s3Out.start();

s3In.join();
s3Out.join();

但是,请注意它会打印警告:

However, note that it will print a warning:

WARNING: No content length specified for stream data.  Stream contents will be buffered in memory and could result in out of memory errors.

那是因为S3需要在上传之前提前知道数据的大小.不可能提前知道生成的 ZIP 的大小.您可以试试分段上传,但代码会更棘手.不过,这个想法是相似的:一个线程应该读取数据并在 ZIP 流中发送内容,另一个线程应该读取 ZIP 条目并将它们作为多部分上传.上传所有条目(部分)后,应完成多部分.

That's because S3 needs to know the size of data in advance, before the upload. It's impossible to know the size of a resulting ZIP in advance. You can probably try your luck with multipart uploads, but the code will be more trickier. Though, the idea would be similar: one thread should read the data and send the content in ZIP stream and the other thread should read ZIPped entries and upload them as multiparts. After all the entries (parts) are uploaded, the multipart should be completed.

以下是 :

final S3Client client = S3Client.create();

final PipedOutputStream pipedOutputStream = new PipedOutputStream();
final PipedInputStream pipedInputStream = new PipedInputStream(pipedOutputStream);

final Thread s3In = new Thread(() -> {
    try (final ZipOutputStream zipOutputStream = new ZipOutputStream(pipedOutputStream)) {
        client.listObjectsV2Paginator(
                ListObjectsV2Request
                        .builder()
                        .bucket("bucket")
                        .build()
        )
                .contents()
                .forEach((S3Object object) -> {
                    try {
                        if (object.key().endsWith(".png")) {
                            System.out.println("Processing " + object.key());

                            final ZipEntry entry = new ZipEntry(
                                    UUID.randomUUID().toString() + ".png" // I'm too lazy to extract file name from the object
                            );

                            zipOutputStream.putNextEntry(entry);

                            client.getObject(
                                    GetObjectRequest
                                            .builder()
                                            .bucket("bucket")
                                            .key(object.key())
                                            .build(),
                                    ResponseTransformer.toOutputStream(zipOutputStream)
                            );

                            zipOutputStream.closeEntry();
                        }
                    } catch (final Exception all) {
                        all.printStackTrace();
                    }
                });
    } catch (final Exception all) {
        all.printStackTrace();
    }
});
final Thread s3Out = new Thread(() -> {
    try {
        client.putObject(
                PutObjectRequest
                        .builder()
                        .bucket("another-bucket")
                        .key("previews.zip")
                        .build(),
                RequestBody.fromBytes(
                        IOUtils.toByteArray(pipedInputStream)
                )
        );
    } catch (final Exception all) {
        all.printStackTrace();
    }
});

s3In.start();
s3Out.start();

s3In.join();
s3Out.join();

它也有同样的问题:上传前需要在内存中准备好 ZIP.

It suffers from the same plague: the ZIP needs to be prepared in memory before the upload.

如果你有兴趣,我准备了一个演示项目,所以你可以玩代码.

If you're interested, I've prepared a demo project, so you can play with the code.

这篇关于使用 AWS S3 Java 将 ZipOutputStream 上传到 S3,而无需将 zip 文件(大)临时保存到磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆