亚马逊EMR上的s3fs:它会扩展大约1亿个小文件吗? [英] s3fs on Amazon EMR: Will it scale for approx 100million small files?

查看:362
本文介绍了亚马逊EMR上的s3fs:它会扩展大约1亿个小文件吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请参考以下已提出的问题:
写入1亿个文件到s3

打开的文件过多EMR



这里处理的数据量至少约为4-5TB。准确地说 - 300GB的gzip压缩。



输入的大小将随着时间的推移而逐渐增加。



例如,直到2012年12月的日志将包含:

$ p $ UD $ 1,DateTime,Lat,Lng,位置
UDID-2,DateTime,Lat,Lng,位置
UDID-3,DateTime,Lat,Lng,位置
UDID-1,DateTime,Lat,Lng,位置

为此,我们必须使用UDID(唯一设备标识符)生成单独的文件作为属于该UDID的文件名和记录

例如:

  UDID-1 .dat =>文件内容
DateTime1,Lat1,Lng1,Location1
DateTime2,Lat2,Lng2,Location2
DateTime3,Lat3,Lng3,Location3

现在,当我们拥有2013年1月份的日志时,此步骤将读取旧数据,此步骤中为较早月份生成的文件,以及新的日志来聚合UDID的数据。



例:

 如果Jan月份的日志具有如下记录:UDID-1,DateTime4,Lat4,Lng4,Location4,则需要使用此数据更新文件UDID-1.dat。每个UDID的文件应按时间顺序排序。 

在这一步中,我们考虑将数据写入EBS卷并保持原样以后使用。但EBS卷的限制为1TB。正如前面提到的那样,在s3上直接生成文件或者在HDFS上生成文件然后转移到s3文件对于这种用例来说不是一个可行的选择,因为大约有1亿个需要移动的小文件。即使使用s3distcp,移动如此大量的文件也是如此缓慢。

因此,接下来我们将尝试s3fs - 基于FUSE的由Amazon S3支持的文件系统。有谁知道s3fs的可扩展性如何?它能够处理1亿个小文件吗?需要多少时间才能将3-5TB的数据从s3传输到本地文件系统,以便它可以被MR作业使用?将数据移回s3需要多长时间?使用s3distcp时遇到同样的问题吗?



提前致谢!

解决方案

我建议不要使用s3fs来复制大量的小文件。

我曾多次尝试从HDFS移动大量小文件,并且s3fs守护进程一直处于崩溃状态。我正在使用 cp rsync 。如果您正在进行增量更新,这会变得更加恶化。另一种方法是使用 use_cache 选项并查看它的行为。



我们使用s3cmd并迭代通过每个文件说,使用Unix find 命令。例如:

  find< hdfs fuse mounted dir> -type f -exec s3cmd put {} s3:// bucketname \; 

您也可以尝试 s3cmd同步与像这样:

  s3cmd sync /< local-dir> / s3:// bucketname 


Please refer to the following questions already asked: Write 100 million files to s3 and Too many open files in EMR

The size of data being handled here is atleast around 4-5TB. To be precise - 300GB with gzip compression.

The size of input will grow gradually as this step aggregates the data over time.

For example, the logs till December 2012 will contain:

UDID-1, DateTime, Lat, Lng, Location
UDID-2, DateTime, Lat, Lng, Location
UDID-3, DateTime, Lat, Lng, Location
UDID-1, DateTime, Lat, Lng, Location

For this we would have to generate separate files with UDID(Unique device identifier) as filenames and records belonging to that UDID in the file in sorted order.

Ex:

UDID-1.dat => File Contents
DateTime1, Lat1, Lng1, Location1
DateTime2, Lat2, Lng2, Location2
DateTime3, Lat3, Lng3, Location3

Now when we have the logs for the month of Jan, 2013, this step will read both the old data, the files already generated for the older months by this step, and the newer logs to aggregate the data of UDIDs.

Ex:

If the logs for month of Jan has a record as: UDID-1, DateTime4, Lat4, Lng4, Location4, the file UDID-1.dat would need to be updated with this data. Each UDID's file should be chronologically sorted.

For this step, we thought of writing the data to an EBS volume and keep it as-is for later use. But EBS volumes have a limit of 1TB. As already mentioned in the referenced questions, generating the files on s3 directly or generating on HDFS and then moving to s3 is not a viable option for this use case as there are around 100 million small files which needs to be moved. And moving such large number of files is way too slow even by using s3distcp.

So, next we are going to try s3fs - FUSE-based file system backed by Amazon S3. Does anybody have any idea that how scalable is s3fs? Will it be able to handle 100 million small files? How much time will it take to move 3-5TB of data, spread across 100 million files, from s3 to local filesystem so that it can be used by the MR job? And how much time will it take to move the data back to s3? Will it have the same problem as was faced while using s3distcp?

Thanks in advance !

解决方案

I would recommend against using s3fs to copy large amounts of small files.

I've had tried on a few occasions to move large amounts of small files from HDFS and the s3fs daemon kept on crashing. I was using both cp and rsync. This is gets even more aggravating if you are doing incremental updates. One alternative is to use the use_cache option and see how it behaves.

We have resorted to using s3cmd and iterating through each one of the files say using the Unix find command. Something like this:

find <hdfs fuse mounted dir> -type f -exec s3cmd put {} s3://bucketname \;

You can also try the s3cmd sync with something like this:

s3cmd sync /<local-dir>/ s3://bucketname

这篇关于亚马逊EMR上的s3fs:它会扩展大约1亿个小文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆