HDFS块大小是实际文件大小 [英] HDFS block size Vs actual file size

查看:959
本文介绍了HDFS块大小是实际文件大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道HDFS在数据节点中使用常规的linux文件系统存储数据。我的HDFS块大小为 128 MB 。假设我在hadoop集群中拥有 10 GB 的磁盘空间,这意味着HDFS最初有 80个块可用存储。



如果我创建一个 12.8 MB 的小文件,#可用的HDFS块将变为79.如果我创建另一个小文件 12.8 MB ,会发生什么? #availbale街区会保持在79点还是会降至78点?在前一种情况下,HDFS基本上根据可用的可用磁盘空间重新计算每个块分配后的#available块,因此只有在消耗超过128 MB的磁盘空间后,#available块才会变为78。请澄清。

解决方案

最好的方法是尝试一下,看看我的结果。 b

但在尝试之前,我的猜测是,即使只能在配置中分配80个完整块,您也可以分配超过80个非空文件。这是因为我认为每次分配非空文件时HDFS都不会使用完整的块。换言之,HDFS块不是一个存储分配单元,而是一个复制单元。我认为HDFS的存储分配单元是底层文件系统的单位(如果您使用的块大小为4 KB的ext4,并且您在复制因子为3的群集中创建1 KB文件,则会消耗3次4 KB = 12 KB的硬盘空间)。



足够的猜测和思考,让我们尝试一下。我的实验室配置如下:


  • hadoop version 1.0.4

  • 4个数据节点,每个节点具有少于5.0G的可用空间,ext4块大小为4KB
  • 块大小为64MB,默认复制为1


启动HDFS后,我有以下NameNode摘要:
$ b $ ul

  • 1个文件和目录,0个块= 1 total

  • 使用过的DFS:112 KB
  • 剩余的DFS:19.82 GB


    然后我执行以下命令:


    • hadoop fs -mkdir / test

    • for f in $(seq 1 10);做hadoop fs -copyFromLocal ./1K_file / test / $ f;完成



    有了这些结果:


    • 12个文件和目录,10个块= 22个总共

    • 使用过的DFS:122.15 KB
    • DFS剩余量:19.82 GB



    所以这10个文件并没有消耗10倍的64 MB(没有修改DFS Remaining)。


    I know that HDFS stores data using the regular linux file system in the data nodes. My HDFS block size is 128 MB. Lets say that I have 10 GB of disk space in my hadoop cluster that means, HDFS initially has 80 blocks as available storage.

    If I create a small file of say 12.8 MB, #available HDFS blocks will become 79. What happens if I create another small file of 12.8 MB? Will the #availbale blocks stay at 79 or will it come down to 78? In the former case, HDFS basically recalculates the #available blocks after each block allocation based on the available free disk space so, #available blocks will become 78 only after more than 128 MB of disk space is consumed. Please clarify.

    解决方案

    The best way to know is to try it, see my results bellow.

    But before trying, my guess is that even if you can only allocate 80 full blocks in your configuration, you can allocate more than 80 non-empty files. This is because I think HDFS does not use a full block each time you allocate a non-empty file. Said in another way, HDFS blocks are not a storage allocation unit, but a replication unit. I think the storage allocation unit of HDFS is the unit of the underlying filesystem (if you use ext4 with a block size of 4 KB and you create a 1 KB file in a cluster with replication factor of 3, you consume 3 times 4 KB = 12 KB of hard disk space).

    Enough guessing and thinking, let's try it. My lab configuration is as follow:

    • hadoop version 1.0.4
    • 4 data nodes, each with a little less than 5.0G of available space, ext4 block size of 4K
    • block size of 64 MB, default replication of 1

    After starting HDFS, I have the following NameNode summary:

    • 1 files and directories, 0 blocks = 1 total
    • DFS Used: 112 KB
    • DFS Remaining: 19.82 GB

    Then I do the following commands:

    • hadoop fs -mkdir /test
    • for f in $(seq 1 10); do hadoop fs -copyFromLocal ./1K_file /test/$f; done

    With these results:

    • 12 files and directories, 10 blocks = 22 total
    • DFS Used: 122.15 KB
    • DFS Remaining: 19.82 GB

    So the 10 files did not consume 10 times 64 MB (no modification of "DFS Remaining").

    这篇关于HDFS块大小是实际文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆