HDFS - 块大小相关 [英] HDFS - Block Size Related

查看:139
本文介绍了HDFS - 块大小相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只有10 MB大小的文件。我认为在HDFS中,第一个文件消耗10 MB,其余54 MB被释放到可用空间。
我的问题是 -


  1. 第二个10 MB文件(或下一个10 MB文件序列)将继续添加直到它变成64 MB?
    例如 - 如果我们消耗2块64 MB的每块和20 MB的第3块,那么输入分割将产生3个输出2 64MB和1 20MB?是否真的如此?
  2. 使用 Hadoop - 权威指南中的参考 strong>:

    HDFS存储小文件效率低,由于每个文件都存储在一个块中,并且块元数据由namenode存储在内存中。因此,大量的小文件会在namenode上占用大量内存。 (但是,请注意,小文件不会占用比存储文件原始内容所需的更多磁盘空间。例如,1 MB文件存储的块大小为128 MB使用1 MB的磁盘空间,而不是128 MB。)



    所以你是对的,HDFS第一个文件消耗10 MB,剩下54 MB被释放到广告可用空间。

    但是,HDFS块不是物理存储分配单元,而是逻辑存储分配单元。因此它不会像这样继续增加直到它变为64 MB或块大小(因为释放的磁盘空间被添加到可用存储)

    映射器的数量取决于inputsplits和Job Client的数量在运行作业时指定的位于HDFS输入路径中的数据上计算输入拆分。按照你的例子,它将创建3个输入分割,2个64MB和1个20MB(假设默认的HDFS块大小)。

    I have files of 10 MB size only.I think in HDFS first file consumes 10 MB and rest 54 MB is freed up to ad to the available space. My question is -

    1. The second file of 10 MB (or the next sequence of 10 MB files) would keep on adding to this until it becomes 64 MB? For e.g. -In total if we consume 2 blocks of 64 MB each and 20 MB of 3rd block, then the input split will give 3 outputs 2 64MB and 1 20MB ? Is it true?

    解决方案

    With Reference from Hadoop - The Definitive Guide:

    HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode. (Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)

    So you are right about, "HDFS first file consumes 10 MB and rest 54 MB is freed up to ad to the available space."

    However, HDFS blocks are not a physical storage allocation unit, but a logical storage allocation unit.So its not like it would keep on adding to this block until it becomes 64 MB or the block size.(As the freed up disk space is added to available storage)

    Number of mappers depends on number of inputsplits and Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. so as per your example it will create 3 inputsplits, 2 64MB and 1 20MB(Assuming default HDFS block size).

    这篇关于HDFS - 块大小相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆