文件数量与HDFS中的块数 [英] No. of files Vs No. of blocks in HDFS
问题描述
$ hadoop fsck / user / root / mydatadir -block
时,我真的很困惑输出它:
状态:HEALTHY
总大小:998562090 B
共计目标:1
文件总数:50(当前正在编写的文件数:1 )
总计数据块(已验证):36(平均数据块大小27737835 B)(总共打开的文件数据块(未验证):1)
最小复制数据块:36(100.0%)
超过 - 复制块:0(0.0%)
低复制块:36(100.0%)
错误复制块:0(0.0%)
默认复制因子:2
平均块复制:1.0
损坏块:0
缺失副本:72(200.0%)
数据节点数量:1
机架数量:1
它说我已经写了 50个文件,但它只使用 据我了解,每个文件使用至少1个块,即使其大小小于HDFS块大小(对于我来说它是64MB,默认大小).ie,我预计50个块有50个文件。我的理解出了什么问题? 这些文件不需要每个完整的块。关心的是管理它们的开销 - 如果你真的有很多它们的名称 - 使用namenode: 从Hadoop - 权威指南: 小文件不会占用比$ b $所需的更多的磁盘空间。b b存储文件的原始内容。例如,存储 然而,除非使用HAR,SequenceFile或CombineFileIputFormat等专用输入格式,否则一个块只包含一个文件。以下是一些更多信息:小文件问题信息 I am running a singlenode hadoop environment. When I ran It says I have written 50 files and yet it only uses 36 blocks (I just Ignore the file currently being written). From my understanding each file uses atleast 1 block even though its size is less than HDFS block size(for me it's 64MB, the default size).i.e, I expect 50 blocks for 50 files. What is wrong with my understanding ? The files do not require full blocks each. The concern is overhead of managing them as well as - if you have truly many of them- namenode utilization: From Hadoop - The Definitive Guide: small files do not take up any more disk space than is required to
store the raw contents of the file. For example, a 1 MB file stored
with a block size of 128 MB uses 1 MB of disk space, not 128 MB.)
Hadoop Archives, or HAR files, are a file archiving facility that
packs files into HDFS blocks more efficiently, thereby reducing
namenode memory usage while still allowing transparent access to
files. However, a single block only contains a single file - unless a specialized input format such as HAR, SequenceFile, or CombineFileIputFormat is used. Here is some more information Small File problem info 这篇关于文件数量与HDFS中的块数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
且块大小为128 MB的1 MB文件使用1 MB磁盘空间,而不是128 MB。)
Hadoop归档文件或HAR文件是文件归档工具,其中
更高效地将文件打包到HDFS块中,从而减少
namenode的内存使用量,同时仍允许透明地访问
文件。
$hadoop fsck /user/root/mydatadir -block
, I really got confused around output it gave:
Status: HEALTHY
Total size: 998562090 B
Total dirs: 1
Total files: 50 (Files currently being written: 1)
Total blocks (validated): 36 (avg. block size 27737835 B) (Total open file blocks (not validated): 1)
Minimally replicated blocks: 36 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 36 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 1.0
Corrupt blocks: 0
Missing replicas: 72 (200.0 %)
Number of data-nodes: 1
Number of racks: 1