如何从S3下载大型csv文件,而不会遇到“内存不足"的问题? [英] How to download large csv files from S3 without running into 'out of memory' issue?
问题描述
我需要处理存储在S3存储桶中的大文件.我需要将csv文件分成较小的块进行处理.但是,这似乎是在文件系统存储上胜于在对象存储上完成的任务.因此,我计划将大文件下载到本地,将其分成较小的块,然后将结果文件一起上传到另一个文件夹中.我知道方法 download_fileobj
,但无法确定在下载大小约为10GB的大文件时,是否会导致内存不足
错误.
I need to process large files stored in S3 bucket. I need to divide the csv file into smaller chunks for processing. However, this seems to be a task done better on file-system storage rather an on object storage.
Hence, I am planning to download the large file to local, divide it into smaller chunks and then upload the resultant files together in a different folder.
I am aware of the method download_fileobj
but could not determine whether it would result in out of memory
error while downloading large files of sizes ~= 10GB.
推荐答案
I would recommend using download_file()
:
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
下载时,它不会用完内存.Boto3将负责转移过程.
It will not run out of memory while downloading. Boto3 will take care of the transfer process.
这篇关于如何从S3下载大型csv文件,而不会遇到“内存不足"的问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!