将 S3 文件拆分为 1000 行的小文件 [英] Split S3 file into smaller files of 1000 lines

查看:17
本文介绍了将 S3 文件拆分为 1000 行的小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 S3 上有一个大约 3 亿行的文本文件.我希望将此文件拆分为每个 1,000 行的较小文件(最后一个文件包含其余部分),然后我想将其放入 S3 上的另一个文件夹或存储桶中.

I have a text file on S3 with around 300 million lines. I'm looking to split this file into smaller files of 1,000 lines each (with the last file containing the remainder), which I'd then like to put into another folder or bucket on S3.

到目前为止,我一直在使用 linux 命令在本地驱动器上运行它:

So far, I've been running this on my local drive using the linux command:

split -l 1000 file

将原始文件拆分为 1,000 行的较小文件.但是,对于这样的较大文件,下载然后从本地驱动器重新上传到 S3 似乎效率低下.

which splits the original file into smaller files of 1,000 lines. However, with a larger file like this, it seems inefficient to download and then re-upload from my local drive back up to S3.

拆分此 S3 文件的最有效方法是什么,最好使用 Python(在 Lambda 函数中)或使用其他 S3 命令?在我的本地驱动器上运行它会更快吗?

What would be the most efficient way to split this S3 file, ideally using Python (in a Lambda function) or using other S3 commands? Is it faster to just run this on my local drive?

推荐答案

您所做的任何事情都必须下载、拆分并重新上传文件.唯一的问题是在哪里,以及是否涉及本地磁盘.

Anything that you do will have to download the file, split it, and re-upload it. The only question is where, and whether local disk is involved.

John Rotenstein 为您提供了一个在 EC2 实例上使用本地磁盘的示例.这样做的好处是可以在 AWS 数据中心运行,因此可以获得高速连接,但存在以下限制:(1) 您需要磁盘空间来存储原始文件及其片段,以及 (2) 您需要一个 EC2 实例在哪里可以做到这一点.

John Rotenstein gave you an example using local disk on an EC2 instance. This has the benefit of running in the AWS datacenters, so it gets a high-speed connection, but has the limitations that (1) you need disk space to store the original file and its pieces, and (2) you need an EC2 instance where you can do this.

一个小优化是避免大文件的本地副本,通过使用连字符作为 s3 cp 的目的地:这会将输出发送到标准输出,然后您可以通过管道将其转换为 split(这里我还使用连字符来告诉 split 从标准输入中读取):

One small optimization is to avoid the local copy of the big file, by using a hyphen as the destination of the s3 cp: this will send the output to standard out, and you can then pipe it into split (here I'm also using a hyphen to tell split to read from standard input):

aws s3 cp s3://my-bucket/big-file.txt - | split -l 1000 - output.
aws s3 cp output.* s3://dest-bucket/

同样,这需要一个 EC2 实例来运行它,以及输出文件的存储空间.但是,split 有一个标志,可以让您为拆分中的每个文件运行一个 shell 命令:

Again, this requires an EC2 instance to run it on, and the storage space for the output files. There is, however, a flag to split that will let you run a shell command for each file in the split:

aws s3 cp s3://src-bucket/src-file - | split -b 1000 --filter 'aws s3 cp - s3://dst-bucket/result.$FILE' -

所以现在您已经消除了本地存储的问题,但剩下的问题是在哪里运行它.我的建议是 AWS Batch,它可以仅在执行命令所需的时间内启动 EC2 实例.

So now you've eliminated the issue of local storage, but are left with the issue of where to run it. My recommendation would be AWS Batch, which can spin up an EC2 instance for just the time needed to perform the command.

当然,您可以编写一个 Python 脚本来在 Lambda 上执行此操作,这样做的好处是在源文件上传到 S3 后会自动触发.我对 Python SDK (boto) 不太熟悉,但看起来 get_object 将原始文件的正文作为 字节流,然后您可以将其作为行进行迭代,将您想要的行数累积到每个输出文件中.

You can, of course, write a Python script to do this on Lambda, and that would have the benefit of being triggered automatically when the source file has been uploaded to S3. I'm not that familiar with the Python SDK (boto), but it appears that get_object will return the original file's body as a stream of bytes, which you can then iterate over as lines, accumulating however many lines you want into each output file.

这篇关于将 S3 文件拆分为 1000 行的小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆