文件数量与HDFS中的块数 [英] No. of files Vs No. of blocks in HDFS

查看：685 发布时间：2018/5/31 19:39:36 hadoop hdfs

本文介绍了文件数量与HDFS中的块数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在运行一个单节点hadoop环境。当我运行 $ hadoop fsck / user / root / mydatadir -block 时，我真的很困惑输出它：

  
状态：HEALTHY 
总大小：998562090 B 
共计目标：1 
文件总数：50（当前正在编写的文件数：1 ）
总计数据块（已验证）：36（平均数据块大小27737835 B）（总共打开的文件数据块（未验证）：1）
最小复制数据块：36（100.0％）
超过 - 复制块：0（0.0％）
低复制块：36（100.0％）
错误复制块：0（0.0％）
默认复制因子：2 
平均块复制：1.0 
损坏块：0 
缺失副本：72（200.0％）
数据节点数量：1 
机架数量：1

它说我已经写了 50个文件，但它只使用 36 blocks （我只是忽略当前正在写入的文件）。

据我了解，每个文件使用至少1个块，即使其大小小于HDFS块大小（对于我来说它是64MB，默认大小）.ie，我预计50个块有50个文件。我的理解出了什么问题？

解决方案

这些文件不需要每个完整的块。关心的是管理它们的开销 - 如果你真的有很多它们的名称 - 使用namenode：

从Hadoop - 权威指南：

小文件不会占用比$ b $所需的更多的磁盘空间。b b存储文件的原始内容。例如，存储
且块大小为128 MB的1 MB文件使用1 MB磁盘空间，而不是128 MB。）
Hadoop归档文件或HAR文件是文件归档工具，其中
更高效地将文件打包到HDFS块中，从而减少
namenode的内存使用量，同时仍允许透明地访问
文件。

然而，除非使用HAR，SequenceFile或CombineFileIputFormat等专用输入格式，否则一个块只包含一个文件。以下是一些更多信息：小文件问题信息

I am running a singlenode hadoop environment. When I ran $hadoop fsck /user/root/mydatadir -block, I really got confused around output it gave:
Status: HEALTHY Total size: 998562090 B Total dirs: 1 Total files: 50 (Files currently being written: 1) Total blocks (validated): 36 (avg. block size 27737835 B) (Total open file blocks (not validated): 1) Minimally replicated blocks: 36 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 36 (100.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 1.0 Corrupt blocks: 0 Missing replicas: 72 (200.0 %) Number of data-nodes: 1 Number of racks: 1
It says I have written 50 files and yet it only uses 36 blocks (I just Ignore the file currently being written).

From my understanding each file uses atleast 1 block even though its size is less than HDFS block size(for me it's 64MB, the default size).i.e, I expect 50 blocks for 50 files. What is wrong with my understanding ?
解决方案
The files do not require full blocks each. The concern is overhead of managing them as well as - if you have truly many of them- namenode utilization:

From Hadoop - The Definitive Guide:

small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.

However, a single block only contains a single file - unless a specialized input format such as HAR, SequenceFile, or CombineFileIputFormat is used. Here is some more information Small File problem info

这篇关于文件数量与HDFS中的块数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

文件数量与HDFS中的块数 [英] No. of files Vs No. of blocks in HDFS

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

文件数量与HDFS中的块数 [英] No. of files Vs No. of blocks in HDFS

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭