使用Ruby将大文件上传到S3时出现内存不足错误,如何批量读取和上传? [英] Uploading Large File to S3 with Ruby Fails with Out of Memory Error, How to Read and Upload in Chunks?
问题描述
我们正在从Windows计算机通过Ruby AWS开发工具包(v2)将各种文件上传到S3.我们已经使用Ruby 1.9进行了测试.我们的代码可以正常工作,除非遇到大文件,抛出内存不足错误.
We are uploading various files to S3 via the Ruby AWS SDK (v2) from a Windows machine. We have tested with Ruby 1.9. Our code works fine except when large files are encountered, when an out of memory error is thrown.
首先,我们使用以下代码将整个文件读入内存:
At first we were reading the whole file into memory with this code:
:body => IO.binread(filepath),
然后在谷歌搜索之后,我们发现有一些方法可以使用Ruby读取文件:
Then after Googling we found that there were ways to read the file in chunks with Ruby:
:body => File.open(filepath, 'rb') { |io| io.read },
虽然此代码无法解决问题,但是我们找不到特定的S3(或相关)示例,该示例显示了如何读取文件并将其分块传递给S3.整个文件仍会加载到内存中,大文件会抛出内存不足错误.
This code did not resolve the issue though, and we can't find a specific S3 (or related) example which shows how the file can be read and passed to S3 in chunks. The whole file is still loaded into memory and throws an out of memory error with large files.
我们知道我们可以将文件拆分为多个块,然后使用AWS分段上传将其上传到S3,但是首选是尽可能避免这种情况(尽管这是唯一的方法).
We know we can split the file into chunks and upload to S3 using the AWS multi part upload, however the preference would be to avoid this if possible (although it's fine if it's the only way).
我们的代码示例如下.读取大块文件,避免内存不足错误并上传到S3的最佳方法是什么?
Our code sample is below. What is the best way to read the file in chunks, avoiding the out of memory errors, and upload to S3?
require 'aws-sdk'
filepath = 'c:\path\to\some\large\file.big'
bucket = 's3-bucket-name'
s3key = 'some/s3/key/file.big'
accesskeyid = 'ACCESSKEYID'
accesskey = 'ACCESSKEYHERE'
region = 'aws-region-here'
s3 = Aws::S3::Client.new(
:access_key_id => accesskeyid,
:secret_access_key => accesskey,
:region => region
)
resp = s3.put_object(
:bucket => bucket,
:key => s3key,
:body => File.open(filepath, 'rb') { |io| io.read },
)
请注意,我们没有达到S3 5GB的限制,例如1.5GB的文件就是这种情况.
Note that we are not hitting the S3 5GB limit, this is happening for files for example of 1.5GB.
推荐答案
用于Ruby的v2 AWS开发工具包aws-sdk
gem支持直接通过网络流式传输对象,而无需将其加载到内存中.您的示例仅需进行较小的更改即可完成此操作:
The v2 AWS SDK for Ruby, aws-sdk
gem, supports streaming objects directly over over the network without loading them into memory. Your example requires only a small correction to do this:
File.open(filepath, 'rb') do |file|
resp = s3.put_object(
:bucket => bucket,
:key => s3key,
:body => file
)
end
之所以起作用,是因为它允许SDK每次对传入少量字节的文件对象调用#read
.在没有第一个参数的情况下在Ruby IO对象(例如文件)上调用#read
会将整个对象读入内存,并将其作为字符串返回.这就是导致内存不足错误的原因.
This works because it allows the SDK to call #read
on the file object passing in a small number of bytes each time. Calling #read
on a Ruby IO object, such as a file, without a first argument will read the entire object into memory, returning it as a string. This is what has caused your out-of-memory errors.
也就是说,aws-sdk
gem提供了另一个更有用的界面,用于将文件上传到Amazon S3.此替代界面会自动:
That said, the aws-sdk
gem provides another, more useful interface for uploading files to Amazon S3. This alternative interface automatically:
- 对大型对象使用多部分API
- 可以使用多个线程并行上传零件,从而提高上传速度
- 计算数据客户端的MD5,以进行服务端数据完整性检查.
一个简单的例子:
# notice this uses Resource, not Client
s3 = Aws::S3::Resource.new(
:access_key_id => accesskeyid,
:secret_access_key => accesskey,
:region => region
)
s3.bucket(bucket).object(s3key).upload_file(filepath)
这是aws-sdk
资源接口的一部分.这里有很多有用的实用程序. Client类仅提供基本的API功能.
This is part of the aws-sdk
resource interfaces. There are quite a few helpful utilities in here. The Client class only provides basic API functionality.
这篇关于使用Ruby将大文件上传到S3时出现内存不足错误,如何批量读取和上传?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!