/ ol>
由于只有部分数据写在有问题的datanode上,所以我们必须完全删除这部分数据。本书中的下一行内容解释了这一点。
好的datanode上的当前块被赋予一个新的标识,到namenode,以便失败datanode上的部分块将被删除,如果失败的datanode后来恢复。
Referring to Tom White's book Hadoop definitive guide .....
My question (assuming replication factor 3 and data being written to node D1,D2,D3)
If I understand correctly, if writing to the first location D1 itself is failed.... whole process with new pipeline is initiated...
What if writing to second node D2 is getting failed ? .. it is said that "any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets" and The current block on the good datanodes is given a new identity..
I am not clear on this point...
- Block getting new identity
- Who gives this new identity
- why is it needed ?
解决方案
Two answer your question, I would like to highlight one point. Either read or write
operations have been initiated by Client
( HDFS Client
).
Have a look at this diagram.
In entire process, client is either reading/writing from/to data nodes directly and not through NameNode. NameNode will just send the list of datanodes to be contacted for read or write operation.
Coming back to your query,
"any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets"
After this line, you can find below line
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. The failed datanode is removed from the pipeline, and a new pipeline is constructed from
the two good datanodes.
The above point will answer your first query : 1. Block getting new identity
Who gives this new identity: Even though it's not explicit, we can conclude that HDFSClient
is responsbile to provide new identity and inform the NameNode about new identity.
Why is it needed ?
Since only partial data is written on problematic datanode, we have to remove this block of data completely. Same was explained in next set of lines in the book.
The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on.
这篇关于Hadoop文件写入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!