为什么Map任务输出写入本地磁盘而不是HDFS? [英] Why Map tasks outputs are written to the local disk and not to HDFS?

查看:786
本文介绍了为什么Map任务输出写入本地磁盘而不是HDFS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 为什么Map任务输出被写入本地磁盘而不是HDFS? 

以下是我的想法:


  • 减少网络流量使用量,因为减速器可能与输出在同一台计算机上运行,​​因此不需要进行复制。
  • 不需要HDFS的容错功能。如果工作中途中断,我们总是可以重新执行地图任务。



其他可能的原因是什么?我的答案是否合理?

解决方案

你的推理是正确的。
然而我想补充几点:如果地图输出写入hdfs会怎么样。
现在,写入hdfs不像写入本地磁盘。这是一个更复杂的过程,namenode确保至少将 dfs.replication.min 副本写入hdfs。并且namenode还会运行后台线程为复制块创建更多副本。
假设用户在两者之间杀死作业或者作业失败。将有大量的中间文件,无需您手动删除hdfs。如果这个过程发生太多次,你的集群的性能会降低。 Hdfs针对追加和不频繁删除进行了优化。另外,在映射阶段,如果作业失败,它将在退出之前执行清理。如果是hdfs,删除过程将要求namenode向相应的datanode发送块删除消息,这将导致该块无效,并将其从 blocksMap 中删除​​。如此多的操作只涉及一次失败的清理,并且没有任何收益!!


I am prepping for an exam and here is a question in the lecture notes:

Why Map tasks outputs are written to the local disk and not to HDFS?

Here are my thoughts:

  • Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
  • Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.

What are other possible reasons? Are my answers reasonable?

解决方案

Your reasonings are correct.
However I would like to add few points: what if map outputs are written to hdfs.
Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks.
Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .
Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!

这篇关于为什么Map任务输出写入本地磁盘而不是HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆