未使用 DFS 究竟是什么意思? [英] What exactly Non DFS Used means?

查看:28
本文介绍了未使用 DFS 究竟是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我最近在 Web UI 上看到的

This is what I saw on Web UI recently

 Configured Capacity     :   232.5 GB
 DFS Used    :   112.44 GB
 Non DFS Used    :   119.46 GB
 DFS Remaining   :   613.88 MB
 DFS Used%   :   48.36 %
 DFS Remaining%  :   0.26 %

而且我很困惑,非 dfs Used 占用了一半以上的容量,

and I'm so confused that non-dfs Used takes up more than half of capacity,

我认为这意味着一半的 hadoop 存储空间被浪费了

which I think means half of hadoop storage is being wasted

在无意义的搜索之后,我只是格式化了namenode,然后从头开始.

After spending meaningless time searching, I just formatted namenode, and started from scratch.

然后我将一个巨大的文本文件(大约 19GB)从本地复制到 HDFS(成功).

And then I copied one huge text file(about 19gigabytes) from local to HDFS (successed).

现在用户界面说

Configured Capacity  :   232.5 GB
DFS Used     :   38.52 GB
Non DFS Used     :   45.35 GB
DFS Remaining    :   148.62 GB
DFS Used%    :   16.57 %
DFS Remaining%   :   63.92 %

复制前,DFS Used 和 Non DFS Used 均为 0.

before copying, DFS Used and Non DFS Used were both 0.

因为使用的 DFS 大约是原始文本文件大小的两倍,并且我配置了 2 个副本,

Because DFS Used is approximately double the original text file size and I configured 2 copy,

我猜 DFS Used 由 2 个原始副本和元副本组成.

I guess that DFS Used is composed up of 2 copies of original and meta.

但我仍然不知道 Non DFS Used 来自哪里,为什么它比 DFS Used 占用更多容量.

But still I don't have any idea where Non DFS Used came from and why is that takes up so much capcity more than DFS Used.

发生了什么?我做错了吗?

What happend? Did I made mistake?

推荐答案

Non DFS used"的计算公式如下:

"Non DFS used" is calculated by following formula:

未使用 DFS = 配置容量 - 剩余空间 - 使用 DFS

Non DFS Used = Configured Capacity - Remaining Space - DFS Used

这仍然令人困惑,至少对我而言.

It is still confusing, at least for me.

因为配置容量 = 总磁盘空间 - 保留空间.

所以未使用的 DFS =(总磁盘空间 - 保留空间)-剩余空间 - 使用的 DFS

让我们举个例子.假设我有 100 GB 磁盘,并将保留空间 (dfs.datanode.du.reserved) 设置为 30 GB.

Let's take a example. Assuming I have 100 GB disk, and I set the reserved space (dfs.datanode.du.reserved) to 30 GB.

在磁盘中,系统和其他文件使用了40GB,DFS使用了10GB.如果你运行 df -h,您将看到该磁盘卷的可用空间为 50GB.

In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h , you will see the available space is 50GB for that disk volume.

在 HDFS Web UI 中,它将显示

In HDFS web UI, it will show

未使用 DFS = 100GB(总计)- 30 GB(保留)- 10 GB(使用 DFS)- 50GB(剩余)= 10 GB

所以这实际上意味着,您最初配置为为非 dfs 使用保留 30G,为 HDFS 保留 70G.然而,非 dfs 的使用量超过了 30G 预留,并占用了 10 GB 的空间,本应属于 HDFS!

So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However, it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!

术语非 DFS 使用"应该真正重命名为类似非 dfs 使用占用了多少配置的 DFS 容量"

The term "Non DFS used" should really be renamed to something like "How much configured DFS capacity are occupied by non dfs use"

人们应该停止尝试弄清楚为什么非 dfs 在 hadoop 中的使用率如此之高.

And one should stop try to figure out why the non dfs use are so high inside hadoop.

一个有用的命令是 lsof |grep delete,这将帮助您识别那些已被删除的打开文件.有时,Hadoop 进程(如 hive、yarn、mapred 和 hdfs)可能会引用那些已删除的文件.而这些引用会占用磁盘空间.

One useful command is lsof | grep delete, which will help you identify those open file which has been deleted. Sometimes, Hadoop processes (like hive, yarn, and mapred and hdfs) may hold reference to those already deleted files. And these references will occupy disk space.

还有 du -hsx * |排序 -rh |head -10 帮助列出前十个最大的文件夹.

Also du -hsx * | sort -rh | head -10 helps list the top ten largest folders.

这篇关于未使用 DFS 究竟是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆