为什么每次重启机器后都需要格式化HDFS? [英] Why do we need to format HDFS after every time we restart machine?
问题描述
我在我的笔记本电脑上以伪分布式模式安装了Hadoop,操作系统是Ubuntu。
我改变了hadoop存储数据的路径(默认情况下hadoop存储数据在 / tmp
文件夹中)
hdfs-site.xml
文件如下所示:
< property>
<名称> dfs.data.dir< /名称>
<值> / HADOOP_CLUSTER_DATA / data< /值>
< / property>
现在每当我重新启动机器并尝试使用 start-all启动hadoop集群.sh
脚本,数据节点永远不会启动。我确认数据节点不是通过检查日志并使用 jps
命令启动的。然后我
- 使用
stop-all.sh
脚本停止集群。 - 使用
hadoop namenode -format
命令格式化HDFS。
- 使用
start-all.sh启动群集
脚本。
现在一切正常,即使我再次停止并启动群集。只有当我重新启动机器并尝试启动集群时才会出现问题。
- 有没有人遇到类似的问题?
- 为什么会发生这种情况并且
- 我们如何解决这个问题?
通过更改 / tmp ,您确实取得了default.xml> dfs.datanode.data.dir
数据(块)在重新启动时存活。但是HDFS不止是块。您需要确保所有相关的目标远离 / tmp
,最显着的是 dfs.namenode.name.dir
(我不知道你需要更改哪些dirs,这取决于你的配置,但namenode dir是强制性的,也可以是足够的)。
我会也推荐使用更新的Hadoop发行版。顺便说一句,1.1 namenode目录设置是 dfs.name.dir
。
I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp
folder)
hdfs-site.xml
file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh
script, data node never starts. I confirmed that data node is not start by checking logs and by using jps
command.
Then I
- Stopped cluster using
stop-all.sh
script. - Formatted HDFS using
hadoop namenode -format
command. - Started cluster using
start-all.sh
script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
- Has anyone encountered similar problem?
- Why this is happening and
- How can we solve this problem?
By changing dfs.datanode.data.dir
away from /tmp
you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp
, most notably dfs.namenode.name.dir
(I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir
.
这篇关于为什么每次重启机器后都需要格式化HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!