在AWS S3上合并文件(使用Apache Camel) [英] Merging files on AWS S3 (Using Apache Camel)

查看:456
本文介绍了在AWS S3上合并文件(使用Apache Camel)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些文件正在上传到S3并为某些Redshift任务处理.完成该任务后,需要合并这些文件.目前,我正在删除这些文件,然后再次上传合并的文件. 这些会占用大量带宽.有什么方法可以直接在S3上合并文件?

I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again. These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?

我正在使用Apache Camel进行路由.

I am using Apache Camel for routing.

推荐答案

S3允许您将S3文件URI用作复制操作的源.结合S3的多部分上传API,您可以提供多个S3对象URI的

S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3's Multi-Part Upload API, you can supply several S3 object URI's as the sources keys for a multi-part upload.

但是,细节决定成败. S3的多部分上传API的文件最小大小为5MB.因此,如果串联的文件系列中的任何文件为< ;,则为0. 5MB,它将失败.

However, the devil is in the details. S3's multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.

但是,您可以通过利用循环孔来解决此问题,该循环孔允许最终上传的文件为< 5MB(允许,因为这种情况在现实世界中会在上传剩余片段时发生).

However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).

我的生产代码通过以下方式做到这一点:

My production code does this by:

  1. 询问要上传文件的清单
  2. 如果第一部分是 不到5MB,下载片段*并缓冲到磁盘,直到缓冲5MB.
  3. 依序追加零件,直到文件串联完成
  4. 如果非终端文件是< 5MB,将其追加,然后完成上传并创建一个新的上传,然后继续.
  1. Interrogating the manifest of files to be uploaded
  2. If first part is under 5MB, download pieces* and buffer to disk until 5MB is buffered.
  3. Append parts sequentially until file concatenation complete
  4. If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.

最后,S3 API中存在一个错误. ETag(实际上是S3上的所有MD5文件校验和,在分段上传完成时均未正确重新计算.要解决此问题,请在完成后复制罚款.如果在串联过程中使用临时位置,则可以解决此问题)在最后的复制操作上.

Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.

*请注意,您可以下载

* Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.

**您还可以在S3上使用5MB的零块,并将其用作默认的起始块.然后,上传完成后,使用5MB+1 to EOF-1

** You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1

P.S.当我有时间编写此代码要点时,请在此处发布链接.

P.S. When I have time to make a Gist of this code I'll post the link here.

这篇关于在AWS S3上合并文件(使用Apache Camel)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆