Hadoop文件写入 [英] Hadoop file write

查看:160
本文介绍了Hadoop文件写入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用Tom White的书Hadoop权威指南.....
我的问题(假设复制因子3和数据被写入节点D1,D2,D3)
如果我理解正确,到第一个位置D1本身是失败的....整个过程与新的管道启动...
如果写入第二个节点D2失败了怎么办? ..据说ack队列中的任何数据包都被添加到数据队列的前面,这样,发生故障的节点下游的datanode不会错过任何数据包,并且良好数据节点上的当前块被赋予一个新的身份证明
我不清楚这一点...
$ b $ ol

  • 获得新身份证书
  • 谁给这个新身份

  • 为什么需要它?


  • 解决方案

    两个回答你的问题,我想强调一点。由 客户端 启动了读取或写入 HDFS客户端)。



    查看此图表。





    回到您的查询,


    ack队列中的任何数据包都会添加到数据队列的前端,以便故障节点下游的datanode不会错过任何数据包


    在此行之后,您可以在下面找到


    好的datanode会被赋予一个新的身份,并传递给namenode,这样如果失败的datanode稍后恢复,那么失败的datanode上的部分块将被删除。失败的datanode被从流水线中移除,并且从
    这两个良好的datanode构建了一个新的管道。


    以上点将回答您的第一个查询:1. 阻止获取新身份


    1. 谁给了这个新的身份: 即使它不是显式的,我们可以得出结论 HDFSClient 负责提供新的身份并通知NameNode有关新身份。

    2. / ol>

      由于只有部分数据写在有问题的datanode上,所以我们必须完全删除这部分数据。本书中的下一行内容解释了这一点。


      好的datanode上的当前块被赋予一个新的标识,到namenode,以便失败datanode上的部分块将被删除,如果失败的datanode后来恢复。



      Referring to Tom White's book Hadoop definitive guide ..... My question (assuming replication factor 3 and data being written to node D1,D2,D3) If I understand correctly, if writing to the first location D1 itself is failed.... whole process with new pipeline is initiated... What if writing to second node D2 is getting failed ? .. it is said that "any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets" and The current block on the good datanodes is given a new identity.. I am not clear on this point...

      1. Block getting new identity
      2. Who gives this new identity
      3. why is it needed ?

      解决方案

      Two answer your question, I would like to highlight one point. Either read or write operations have been initiated by Client ( HDFS Client).

      Have a look at this diagram.

      In entire process, client is either reading/writing from/to data nodes directly and not through NameNode. NameNode will just send the list of datanodes to be contacted for read or write operation.

      Coming back to your query,

      "any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets"

      After this line, you can find below line

      The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from the two good datanodes.

      The above point will answer your first query : 1. Block getting new identity

      1. Who gives this new identity: Even though it's not explicit, we can conclude that HDFSClient is responsbile to provide new identity and inform the NameNode about new identity.

      2. Why is it needed ?

      Since only partial data is written on problematic datanode, we have to remove this block of data completely. Same was explained in next set of lines in the book.

      The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on.

      这篇关于Hadoop文件写入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆