在AWS S3上合并文件(使用Apache Camel) [英] Merging files on AWS S3 (Using Apache Camel)

查看：456 发布时间：2020/8/23 4:52:29 amazon-web-services amazon-s3

本文介绍了在AWS S3上合并文件(使用Apache Camel)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些文件正在上传到S3并为某些Redshift任务处理.完成该任务后，需要合并这些文件.目前，我正在删除这些文件，然后再次上传合并的文件. 这些会占用大量带宽.有什么方法可以直接在S3上合并文件?

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again. These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?

我正在使用Apache Camel进行路由.

I am using Apache Camel for routing.

推荐答案

S3允许您将S3文件URI用作复制操作的源.结合S3的多部分上传API，您可以提供多个S3对象URI的

S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.

但是，细节决定成败. S3的多部分上传API的文件最小大小为5MB.因此，如果串联的文件系列中的任何文件为< ;，则为0. 5MB，它将失败.

However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.

但是，您可以通过利用循环孔来解决此问题，该循环孔允许最终上传的文件为< 5MB(允许，因为这种情况在现实世界中会在上传剩余片段时发生).

However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).

我的生产代码通过以下方式做到这一点:

My production code does this by:

询问要上传文件的清单
如果第一部分是不到5MB，下载片段*并缓冲到磁盘，直到缓冲5MB.
依序追加零件，直到文件串联完成
如果非终端文件是< 5MB，将其追加，然后完成上传并创建一个新的上传，然后继续.

Interrogating the manifest of files to be uploaded
If first part is under 5MB, download pieces* and buffer to disk until 5MB is buffered.
Append parts sequentially until file concatenation complete
If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.

最后，S3 API中存在一个错误. ETag(实际上是S3上的所有MD5文件校验和，在分段上传完成时均未正确重新计算.要解决此问题，请在完成后复制罚款.如果在串联过程中使用临时位置，则可以解决此问题)在最后的复制操作上.

Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.

*请注意，您可以下载

* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.

**您还可以在S3上使用5MB的零块，并将其用作默认的起始块.然后，上传完成后，使用5MB+1 to EOF-1

** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1

P.S.当我有时间编写此代码要点时，请在此处发布链接.

P.S. When I have time to make a Gist of this code I'll post the link here.

这篇关于在AWS S3上合并文件(使用Apache Camel)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在AWS S3上合并文件(使用Apache Camel) [英] Merging files on AWS S3 (Using Apache Camel)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在AWS S3上合并文件(使用Apache Camel) [英] Merging files on AWS S3 (Using Apache Camel)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭