Hadoop 2.0数据写入操作确认 [英] Hadoop 2.0 data write operation acknowledgement

查看:113
本文介绍了Hadoop 2.0数据写入操作确认的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


<对于常见情况,当复制因子为3时,HDFS的放置策略是将一个副本放在本地机架中的一个节点上,另一个放在另一个(远程)机架中的节点上,而最后一个放在另一个节点上相同的远程机架。该策略可以减少机架间写入流量,这通常会提高写入性能。机架故障的机会远远小于节点故障的机会;


在下面的图片中,当写确认被视为成功时?

1)将数据写入第一个数据节点?

2)将数据写入第一个数据节点+其他数据节点?



我在问这个问题,因为我在YouTube视频中听到两个相互冲突的声明。一个视频报道说,一旦数据被写入一个数据节点&其他的视频报道,只有在向所有三个节点写入数据后才会发送确认信息。



第1步客户端通过调用DistributedFileSystem上的create()方法来创建该文件。步骤2:DistributedFileSystem对名称节点进行RPC调用,以在文件系统的名称空间中创建一个新文件,但不包含与其关联的块。

namenode执行各种检查以确保文件不存在,并且客户端具有创建文件的正确权限。如果这些检查通过,namenode记录新文件;否则,文件创建失败,客户端抛出IOException。第三步:当客户端写入数据时,DFSOutputStream将其拆分为数据包,然后将数据写入数据包,它写入一个内部队列,称为数据队列。数据队列由DataStreamer使用,DataStreamer负责通过选择合适的datanode列表来存储副本,从而要求namenode分配新块。 datanode列表形成一个管道,在这里我们假设复制级别为3,所以管道中有三个节点。 TheDataStreamer将数据包流式传输到管道中的第一个datanode,它存储数据包并将其转发到管道中的第二个datanode。

strong>类似地,第二个datanode存储数据包并将其转发到管道中的第三个(也是最后一个)datanode。

第2步: DFSOutputStream还维护一个等待被datanode确认的内部数据包队列,称为ack队列。只有在流水线中的所有数据节点确认了数据包后,数据包才会从确认队列中删除。



第6步已完成写入数据,它将在流上调用close()。



第7步:此操作将所有剩余数据包刷新到datanode流水线并在联系namenode之前等待确认,表示文件已完成namenode已经知道文件由哪些块组成,因此只需等待块成功返回即可最低限度复制。


I have a small query regarding hadoop data writes

From Apache documentation

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure;

In below image, when the write acknowledge is treated as successful?

1) Writing data to first data node?

2) Writing data to first data node + 2 other data nodes?

I am asking this question because, I have heard two conflicting statements in youtube videos. One video quoted that write is successful once data is written to one data node & other video quoted that acknowledgement will be sent only after writing data to all three nodes.

解决方案

Step 1: The client creates the file by calling create() method on DistributedFileSystem.

Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.

The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.

Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.

Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.

Step 6: When the client has finished writing data, it calls close() on the stream.

Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete The namenode already knows which blocks the file is made up of , so it only has to wait for blocks to be minimally replicated before returning successfully.

这篇关于Hadoop 2.0数据写入操作确认的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆