将1亿个文件写入s3 [英] Write 100 million files to s3

查看:201
本文介绍了将1亿个文件写入s3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的主要目标是根据每条记录的ID将记录分成文件,现在有超过150亿条记录可以增加。我需要一个使用Amazon EMR的可扩展解决方案。我已经完成了一个拥有大约9亿条记录的较小数据集。

输入文件采用csv格式,其中一个字段必须是输出中的文件名。
所以说有以下输入记录:

  awesomeId1,somedetail1,somedetail2 
awesomeID1,somedetail3, somedetail4
awesomeID2,somedetail5,somedetail6

现在2个文件应该是输出,一个命名 awesomeID1.dat 和其他 awesomeID2.dat ,每个记录都有关于各自ID的记录。



输入大小:每月总共600 GB(gzippef文件的大小),每个文件大约为2 3 GB。我需要一次处理大约6个月或更长时间。所以总数据大小将是6 * 600 GB(压缩)。



以前我得到太多打开的文件当我使用时,错误FileByKeyTextOutputFormat扩展了MultipleTextOutputFormat< Text,Text> 以根据id值写入s3。然后,正如我在这里解释了 ,而不是将每个文件直接写入s3,我在本地写了它们,并以1024个文件的批量移动到s3。



但是现在随着数据量的增加,我得到了来自s3的以下消息,然后跳过写入有问题的文件:Please reduce your请求率。另外,我必须在一台配有200个m1.xlarge机器的集群上运行,然后花费大约2个小时,因此它也非常昂贵!



我希望有一个可扩展的解决方案,如果将来数据量再次增加,这个解决方案不会失败。

任何建议?

解决方案

以下是有关SlowDown错误的一些信息: https://forums.aws.amazon.com/message.jspa?messageID=89722#89816 您应该按字母顺序插入S3订购。此外,这个限制是动态的,并随着时间的推移重新调整,所以放慢速度,并在稍后尝试提高您的速度。

也许您最好使用数据库而不是文件系统?总数据集有多大?

DynamoDB可能非常适合,但可能会很贵,为1美元/ GB /月。 (因为它使用SSD作为后备存储。)



RDS是另一种选择。它的价格从0.10美元/ GB /月。

更好的可能是在EC2上托管自己的NoSQL或其他数据存储,比如在新的hs1.8xlarge实例上。你只能在需要时启动它,当你不需要时可以将它备份到S3。


My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records.

Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records:

awesomeId1, somedetail1, somedetail2
awesomeID1, somedetail3, somedetail4
awesomeID2, somedetail5, somedetail6

So now 2 files should be as output, one named awesomeID1.dat and other as awesomeID2.dat, each having records pertaining to respective IDs.

Size of the input: Total 600 GB (size of gzippef files) per month, each files is around 2 3 GB. And I need to process it for around 6 months or more at a time. so Total data size would be 6*600 GB (compressed).

Previously I was getting Too many open files error when I was using FileByKeyTextOutputFormat extends MultipleTextOutputFormat<Text, Text> to write to s3 according to the id value. Then as I have explained here, instead of writing every file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files.

But now with increased amount of data, I am getting following message from s3 and then it skips writing the file in question : "Please reduce your request rate." Also I am having to run on a cluster with 200 m1.xlarge machines which then take around 2 hours, and hence it is very costly too!

I would like to have a scalable solution which shall not fail if amount of data increases again in future.

Any Suggestions?

解决方案

Here is some info on SlowDown errors: https://forums.aws.amazon.com/message.jspa?messageID=89722#89816 You should insert into S3 in alphabetical order. Also the limit is dynamic and re-adjusts over time, so slow down and try to increase your rate later.

Perhaps you are better off using a database than a filesystem? How big is the total dataset?

DynamoDB may be a good fit, but may be expensive at $1/GB/month. (Since it uses SSD for backing storage.)

RDS is another option. Its pricing is from $0.10/GB/month.

Even better may be to host your own NoSQL or other datastore on EC2, such as on the new hs1.8xlarge instance. You can launch it only when you need it, and back it up to S3 when you don't.

这篇关于将1亿个文件写入s3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆