我可以在没有内容长度标头的情况下将文件上传流式传输到 S3 吗? [英] Can I stream a file upload to S3 without a content-length header?

查看:21
本文介绍了我可以在没有内容长度标头的情况下将文件上传流式传输到 S3 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一台内存有限的机器上工作,我想以流式方式将动态生成的(非磁盘)文件上传到 S3.换句话说,我开始上传时不知道文件大小,但我会知道它.通常,PUT 请求有一个 Content-Length 标头,但也许有办法解决这个问题,例如使用 multipart 或 chunked content-type.

I'm working on a machine with limited memory, and I'd like to upload a dynamically generated (not-from-disk) file in a streaming manner to S3. In other words, I don't know the file size when I start the upload, but I'll know it by the end. Normally a PUT request has a Content-Length header, but perhaps there is a way around this, such as using multipart or chunked content-type.

S3 可以支持流式上传.例如,请参见此处:

S3 can support streaming uploads. For example, see here:

http://blog.odonnell.nu/posts/streaming-uploads-s3-python-and-poster/

我的问题是,我可以在上传开始时不必指定文件长度的情况下完成同样的事情吗?

My question is, can I accomplish the same thing without having to specify the file length at the start of the upload?

推荐答案

您必须通过 S3 的多部分 API.每个块都需要一个 Content-Length,但您可以避免将大量数据 (100MiB+) 加载到内存中.

You have to upload your file in 5MiB+ chunks via S3's multipart API. Each of those chunks requires a Content-Length but you can avoid loading huge amounts of data (100MiB+) into memory.

  • 启动 S3 分段上传.
  • 将数据收集到缓冲区中,直到该缓冲区达到 S3 的块大小下限 (5MiB).在构建缓冲区时生成 MD5 校验和.
  • 将该缓冲区作为部分上传,存储 ETag(阅读该缓冲区的文档).
  • 到达数据的 EOF 后,上传最后一个块(可以小于 5MiB).
  • 完成分段上传.
  • Initiate S3 Multipart Upload.
  • Gather data into a buffer until that buffer reaches S3's lower chunk-size limit (5MiB). Generate MD5 checksum while building up the buffer.
  • Upload that buffer as a Part, store the ETag (read the docs on that one).
  • Once you reach EOF of your data, upload the last chunk (which can be smaller than 5MiB).
  • Finalize the Multipart Upload.

S3 允许多达 10,000 个零件.因此,通过选择 5MiB 的部分大小,您将能够上传高达 50GiB 的动态文件.对于大多数用例来说应该足够了.

S3 allows up to 10,000 parts. So by choosing a part-size of 5MiB you will be able to upload dynamic files of up to 50GiB. Should be enough for most use-cases.

但是:如果您需要更多,则必须增加零件尺寸.通过使用更大的部分大小(例如 10MiB)或在上传过程中增加它.

However: If you need more, you have to increase your part-size. Either by using a higher part-size (10MiB for example) or by increasing it during the upload.

First 25 parts:   5MiB (total:  125MiB)
Next 25 parts:   10MiB (total:  375MiB)
Next 25 parts:   25MiB (total:    1GiB)
Next 25 parts:   50MiB (total: 2.25GiB)
After that:     100MiB

这将允许您上传最大 1TB 的文件(S3 目前对单个文件的限制为 5TB),而不会不必要地浪费内存.

This will allow you to upload files of up to 1TB (S3's limit for a single file is 5TB right now) without wasting memory unnecessarily.

他的问题与你的不同 - 他在上传之前知道并使用了 Content-Length.他想改进这种情况:许多库通过将文件中的所有数据加载到内存中来处理上传.在伪代码中,将是这样的:

His problem is different from yours - he knows and uses the Content-Length before the upload. He wants to improve on this situation: Many libraries handle uploads by loading all data from a file into memory. In pseudo-code that would be something like this:

data = File.read(file_name)
request = new S3::PutFileRequest()
request.setHeader('Content-Length', data.size)
request.setBody(data)
request.send()

他的解决方案是通过文件系统 API 获取 Content-Length 来实现的.然后他将数据从磁盘流式传输到请求流中.在伪代码中:

His solution does it by getting the Content-Length via the filesystem-API. He then streams the data from disk into the request-stream. In pseudo-code:

upload = new S3::PutFileRequestStream()
upload.writeHeader('Content-Length', File.getSize(file_name))
upload.flushHeader()

input = File.open(file_name, File::READONLY_FLAG)

while (data = input.read())
  input.write(data)
end

upload.flush()
upload.close()

这篇关于我可以在没有内容长度标头的情况下将文件上传流式传输到 S3 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆