如何正确删除 Hadoop 中的节点? [英] How do I correctly remove nodes in Hadoop?

查看:31
本文介绍了如何正确删除 Hadoop 中的节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在拥有 10 多台机器的集群上运行 Hadoop 1.1.2.我想很好地放大和缩小 HDFS 和 MapReduce.很好"是指我要求数据不丢失(允许 HDFS 节点退役),并且运行任务的节点在关闭之前完成.

我注意到一旦退役完成,datanode 进程就会终止,这很好.这是我删除节点的操作:

  • 将节点添加到 mapred.exclude
  • 将节点添加到 hdfs.exclude
  • $ hadoop mradmin -refreshNodes
  • $ hadoop dfsadmin -refreshNodes
  • $ hadoop-daemon.sh 停止任务跟踪器

要重新添加节点(假设它像上面一样被删除),这就是我正在做的.

  • 从 mapred.exclude 中删除
  • 从 hdfs.exclude 中删除
  • $ hadoop mradmin -refreshNodes
  • $ hadoop dfsadmin -refreshNodes
  • $ hadoop-daemon.sh 启动任务跟踪器
  • $ hadoop-daemon.sh 启动数据节点

这是很好地"放大和缩小的正确方法吗?在缩小规模时,我注意到某些不幸的工作的工作持续时间急剧增加(因为他们在已删除节点上运行的任务需要重新安排).

解决方案

如果之前没有设置dfs排除文件,请按照1-3进行.否则从4开始.

  1. 关闭 NameNode.
  2. 将 dfs.hosts.exclude 设置为指向一个空的排除文件.
  3. 重启 NameNode.
  4. 在 dfs 排除文件中,使用完整的主机名或 IP 或 IP:port 格式指定节点.
  5. 在 mapred.exclude 中做同样的事情
  6. 执行bin/hadoop dfsadmin -refreshNodes.这会强制 NameNode 重新读取排除文件并开始停用过程.
  7. 执行bin/hadoop mradmin -refreshNodes
  8. 监控 NameNode 和 JobTracker Web UI,并确认停用过程正在进行中.更新可能需要几秒钟.完成退役后,NameNode 日志文件中将出现诸如 "Decommission complete for node XXXX.XXXX.X.XX:XXXXX" 之类的消息,此时您可以从集群中删除节点.
  9. 该过程完成后,namenode UI 会将 datanode 列为已退役.Jobtracker 页面将显示更新后的活动节点数.运行 bin/hadoop dfsadmin -report 进行验证.在排除的节点上停止 datanode 和 tasktracker 进程.
  10. 如果您不打算将机器重新引入集群,请将其从集群中移除包含和排除文件.

要将节点添加为数据节点和任务跟踪器,请参阅 Hadoop 常见问题解答页面当要从集群中删除活动节点时,作业会发生什么情况?

在要退役的节点上运行的作业会受到影响,因为在该节点上调度的作业的任务将被标记为 KILLED_UNCLEAN(用于 map 和 reduce 任务)或 KILLED(用于作业设置和清理)任务).请参阅 JobTracker.java 了解详情.作业将被告知该任务失败.大多数情况下,作业跟踪器会重新安排执行.然而,在多次重复失败之后,它可能会决定让整个工作失败或成功.请参阅 JobInProgress.java.

I'm running Hadoop 1.1.2 on a cluster with 10+ machines. I would like to nicely scale up and down, both for HDFS and MapReduce. By "nicely", I mean that I require that data not be lost (allow HDFS nodes to decomission), and nodes running a task finish before shutting down.

I've noticed the datanode process dies once decomissioning is done, which is good. This is what I do to remove a node:

  • Add node to mapred.exclude
  • Add node to hdfs.exclude
  • $ hadoop mradmin -refreshNodes
  • $ hadoop dfsadmin -refreshNodes
  • $ hadoop-daemon.sh stop tasktracker

To add the node back in (assuming it was removed like above), this is what I'm doing.

  • Remove from mapred.exclude
  • Remove from hdfs.exclude
  • $ hadoop mradmin -refreshNodes
  • $ hadoop dfsadmin -refreshNodes
  • $ hadoop-daemon.sh start tasktracker
  • $ hadoop-daemon.sh start datanode

Is this the correct way to scale up and down "nicely"? When scaling down, I'm noticing job-duration rises sharply for certain unlucky jobs (since the tasks they had running on the removed node need to be re-scheduled).

解决方案

If you have not set dfs exclude file before, follow 1-3. Else start from 4.

  1. Shut down the NameNode.
  2. Set dfs.hosts.exclude to point to an empty exclude file.
  3. Restart NameNode.
  4. In the dfs exclude file, specify the nodes using the full hostname or IP or IP:port format.
  5. Do the same in mapred.exclude
  6. execute bin/hadoop dfsadmin -refreshNodes. This forces the NameNode to reread the exclude file and start the decommissioning process.
  7. execute bin/hadoop mradmin -refreshNodes
  8. Monitor the NameNode and JobTracker web UI and confirm the decommission process is in progress. It can take a few seconds to update. Messages like "Decommission complete for node XXXX.XXXX.X.XX:XXXXX" will appear in the NameNode log files when it finishes decommissioning, at which point you can remove the nodes from the cluster.
  9. When the process has completed, the namenode UI will list the datanode as decommissioned. The Jobtracker page will show the updated number of active nodes. Run bin/hadoop dfsadmin -report to verify. Stop the datanode and tasktracker process on the excluded node(s).
  10. If you do not plan to reintroduce the machine to the cluster, remove it from the include and exclude files.

To add a node as datanode and tasktracker see Hadoop FAQ page

EDIT : When a live node is to be removed from the cluster, what happens to the Job ?

The jobs running on a node to be de-commissioned would get affected as the tasks of the job scheduled on that node(s) would be marked as KILLED_UNCLEAN (for map and reduce tasks) or KILLED (for job setup and cleanup tasks). See line 4633 in JobTracker.java for details. The job will be informed to fail that task. Most of the time, Job tracker will reschedule execution. However, after many repeated failures it may instead decide to allow the entire job to fail or succeed. See line 2957 onwards in JobInProgress.java.

这篇关于如何正确删除 Hadoop 中的节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆