HDFS块大小是实际文件大小 [英] HDFS block size Vs actual file size
问题描述
128 MB
。假设我在hadoop集群中拥有 10 GB
的磁盘空间,这意味着HDFS最初有 80个块
可用存储。 如果我创建一个 12.8 MB
,会发生什么? #availbale街区会保持在79点还是会降至78点?在前一种情况下,HDFS基本上根据可用的可用磁盘空间重新计算每个块分配后的#available块,因此只有在消耗超过128 MB的磁盘空间后,#available块才会变为78。请澄清。
最好的方法是尝试一下,看看我的结果。 b
但在尝试之前,我的猜测是,即使只能在配置中分配80个完整块,您也可以分配超过80个非空文件。这是因为我认为每次分配非空文件时HDFS都不会使用完整的块。换言之,HDFS块不是一个存储分配单元,而是一个复制单元。我认为HDFS的存储分配单元是底层文件系统的单位(如果您使用的块大小为4 KB的ext4,并且您在复制因子为3的群集中创建1 KB文件,则会消耗3次4 KB = 12 KB的硬盘空间)。
足够的猜测和思考,让我们尝试一下。我的实验室配置如下:
- hadoop version 1.0.4
- 4个数据节点,每个节点具有少于5.0G的可用空间,ext4块大小为4KB
- 块大小为64MB,默认复制为1
启动HDFS后,我有以下NameNode摘要:
$ b $ ul
然后我执行以下命令:
hadoop fs -mkdir / test
for f in $(seq 1 10);做hadoop fs -copyFromLocal ./1K_file / test / $ f;完成
有了这些结果:
所以这10个文件并没有消耗10倍的64 MB(没有修改DFS Remaining)。
I know that HDFS stores data using the regular linux file system in the data nodes. My HDFS block size is 128 MB
. Lets say that I have 10 GB
of disk space in my hadoop cluster that means, HDFS initially has 80 blocks
as available storage.
If I create a small file of say 12.8 MB
, #available HDFS blocks will become 79. What happens if I create another small file of 12.8 MB
? Will the #availbale blocks stay at 79 or will it come down to 78? In the former case, HDFS basically recalculates the #available blocks after each block allocation based on the available free disk space so, #available blocks will become 78 only after more than 128 MB of disk space is consumed. Please clarify.
The best way to know is to try it, see my results bellow.
But before trying, my guess is that even if you can only allocate 80 full blocks in your configuration, you can allocate more than 80 non-empty files. This is because I think HDFS does not use a full block each time you allocate a non-empty file. Said in another way, HDFS blocks are not a storage allocation unit, but a replication unit. I think the storage allocation unit of HDFS is the unit of the underlying filesystem (if you use ext4 with a block size of 4 KB and you create a 1 KB file in a cluster with replication factor of 3, you consume 3 times 4 KB = 12 KB of hard disk space).
Enough guessing and thinking, let's try it. My lab configuration is as follow:
- hadoop version 1.0.4
- 4 data nodes, each with a little less than 5.0G of available space, ext4 block size of 4K
- block size of 64 MB, default replication of 1
After starting HDFS, I have the following NameNode summary:
- 1 files and directories, 0 blocks = 1 total
- DFS Used: 112 KB
- DFS Remaining: 19.82 GB
Then I do the following commands:
hadoop fs -mkdir /test
for f in $(seq 1 10); do hadoop fs -copyFromLocal ./1K_file /test/$f; done
With these results:
- 12 files and directories, 10 blocks = 22 total
- DFS Used: 122.15 KB
- DFS Remaining: 19.82 GB
So the 10 files did not consume 10 times 64 MB (no modification of "DFS Remaining").
这篇关于HDFS块大小是实际文件大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!