大量小文件和128 Mb块大小的HDFS行为 [英] HDFS behavior on lots of small files and 128 Mb block size

查看:317
本文介绍了大量小文件和128 Mb块大小的HDFS行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多(高达数十万)的小文件,每个10-100Kb。我有HDFS块大小等于128 MB。我有复制因子等于1.



是否存在为每个小文件分配HDFS块的缺陷? p>

我见过相当矛盾的答案:


  1. 表示最小文件占用整个区块的答案

  2. HDFS足够聪明的答案,而小文件会占用small_file_size + 300字节的元数据


  3. 我在这个答案,它证明第二个选项是正确的 - HDFS doesn不会将整个块分配给小文件。

    但是,如何批量读取10.000个小文件l来自HDFS的文件?它会因为10.000块和元数据而减速吗?是否有理由在单个块中保留多个小文件?
    $ b 更新:我的用例

    对于小文件,我只有一个用例,从1.000到500.000。我计算一次这些文件,将其存储起来,而不是一次读完它们。

    <1>据我所知,NameNode空间问题对我来说不是问题。 500.000是绝对最大值,我永远不会有更多。如果每个小文件在NN上占用150个字节,那么我的绝对最大值就是 - 71.52 MB,这是可以接受的。



    <2> Apache Spark消除MapReduce问题?序列文件或HAR会帮助我解决问题吗?
    据我所知,Spark不应该依赖Hadoop MR,但它仍然太慢。 490个文件需要38秒才能读取,3420个文件 - 266秒。

      sparkSession 
    .read()
    .parquet(pathsToSmallFilesCollection)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);


    解决方案

    正如您已经注意到的那样,HDFS文件不会除了需要的空间之外,还有其他缺点:在HDFS集群中有小文件。让我们首先解决问题,而不考虑批处理:


    1. NameNode(NN)内存消耗。我不知道Hadoop 3(目前正在开发中),但在以前的版本中,NN是单点故障(您可以添加辅助NN,但不会替代或增强最终的主NN。 NN负责维护内存和磁盘上的文件系统结构,并且资源有限。由NN维护的文件系统对象中的每个条目被认为是150个字节(检查这篇博文)。更多的文件=更多的内存消耗的神经网络。

    2. MapReduce范例(据我所知,Spark受到相同症状的影响)。在Hadoop Mappers中,每个分区都被分配(默认情况下,该分区对应于该块),这意味着,对于每个小文件,您都需要启动一个新的Mapper来处理其内容。问题在于,对于小文件而言,Hadoop启动映射器实际上比处理文件内容要花费更多。基本上,你的系统将会做不必要的开始/停止Mappers的工作,而不是实际处理数据。这就是Hadoop处理速度非常快的1 128MBytes文件(128 MBytes块大小)而不是128 MB 1MB文件(具有相同块大小)的原因。



    <现在,如果我们谈论批处理,那么你有很少的选择:HAR,序列文件,Avro模式等。这取决于用例给出你的问题的确切答案。假设您不想合并文件,在这种情况下,您可能会使用HAR文件(或任何其他具有高效归档和索引功能的解决方案)。在这种情况下,NN问题得到解决,但Mappers的数量仍然等于分割数量。如果将文件合并为较大的文件是一种选择,则可以使用序列文件,该文件基本上将小文件合并为较大的文件,从而解决一些延伸问题。在这两种情况下,尽管您无法像使用小文件一样直接更新/删除信息,因此需要更复杂的机制来管理这些结构。



    总的来说,维护许多小文件的主要原因是尝试快速读取,所以我建议看一下像HBase这样的不同系统,它们是为快速数据访问而创建的,而不是批处理。


    I have lots (up to hundreds of thousands) of small files, each 10-100 Kb. I have HDFS block size equal 128 MB. I have replication factor equal 1.

    Is there any drawbacks of allocating HDFS block per small file?

    I've seen pretty contradictory answers:

    1. Answer which said the smallest file takes the whole block
    2. Answer which said that HDFS is clever enough, and small file will take small_file_size + 300 bytes of metadata

    I made a test like in this answer, and it proves that the 2nd option is correct - HDFS doesn't allocate the whole block for small files.

    But, how about batch read of 10.000 small files from HDFS? Does it will be slow down because of 10.000 blocks and metadatas? Is there any reason to keep multiple small files within single block?

    Update: my use case

    I have only one use case for small files, from 1.000 up to 500.000. I calculate that files once, store it, and than read them all at once.

    1) As I understand, NameNode space problem is not a problem for me. 500.000 is an absolute maximum, I will never have more. If each small file takes 150 bytes on NN, than the absolute maximum for me is - 71.52 MB, which is acceptable.

    2) Does Apache Spark eliminate MapReduce problem? Will sequence files or HAR help me to solve the issue? As I understand, Spark shouldn't depend on Hadoop MR, but it's still too slow. 490 files takes 38 seconds to read, 3420 files - 266 seconds.

    sparkSession
        .read()
        .parquet(pathsToSmallFilesCollection)
        .as(Encoders.kryo(SmallFileWrapper.class))
        .coalesce(numPartitions);
    

    解决方案

    As you have noticed already, the HDFS file does not take anymore space than it needs, but there are other drawbacks of having the small files in the HDFS cluster. Let's go first through the problems without taking into consideration batching:

    1. NameNode(NN) memory consumption. I am not aware about Hadoop 3 (which is being currently under development) but in previous versions NN is a single point of failure (you can add secondary NN, but it will not replace or enhance the primary NN at the end). NN is responsible for maintaining the file-system structure in memory and on the disk and has limited resources. Each entry in file-system object maintained by NN is believed to be 150 bytes (check this blog post). More files = more RAM consumed by the NN.
    2. MapReduce paradigm (and as far as I know Spark suffers from the same symptoms). In Hadoop Mappers are being allocated per split (which by default corresponds to the block), this means, that for every small file you have out there a new Mapper will need to be started to process its contents. The problem is that for small files it actually takes much more for Hadoop to start the Mapper than process the file content. Basically, you system will be doing unnecessary work of starting/stopping Mappers instead of actually processing the data. This is the reason Hadoop processes much fast 1 128MBytes file (with 128MBytes blocks size) rather than 128 1MBytes files (with same block size).

    Now, if we talk about batching, there are few options you have out there: HAR, Sequence File, Avro schemas, etc. It depends on the use case to give the precise answers to your questions. Let's assume you do not want to merge files, in this case you might be using HAR files (or any other solution featuring efficient archiving and indexing). In this case the NN problem is solved, but the number of Mappers still will be equal to the number of splits. In case merging files into large one is an option, you can use Sequence File, which basically aggregates small files into bigger ones, solving to some extend both problems. In both scenarios though you cannot really update/delete the information directly like you would be able to do with small files, thus more sophisticated mechanisms are required for managing those structures.

    In general, in the main reason for maintaining many small files is an attempt to make fast reads, I would suggest to take a look to different systems like HBase, which were created for fast data access, rather than batch processing.

    这篇关于大量小文件和128 Mb块大小的HDFS行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆