Hadoop:HDFS文件写入&读 [英] Hadoop: HDFS File Writes & Reads

查看:111
本文介绍了Hadoop:HDFS文件写入&读的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



例如,如果我正在编写一个文件,使用默认配置,Hadoop内部必须将每个块写入3个数据节点。我的理解是,对于每个块,首先客户端将块写入管道中的第一个数据节点,然后通知第二个数据节点,依此类推。一旦第三个数据节点成功接收到该数据块,它就会通过数据节点1向数据节点2和客户端提供确认。只有在接收到数据块的确认后,写入才会被视为成功,客户端才会继续写入下一个块

如果是这样的话,那么写入每个块的时间不会超过传统文件写入的时间至 -


  1. 复制因子(默认值为3)和
  2. 写入过程正在顺序进行如果我的理解错误,请纠正我的错误。 此外,还有以下问题:


    1. 我的理解是Hadoop中的文件读/写没有任何并行性,执行与传统文件读取或写入相同(即如果复制设置为1)+分布式通信机制涉及的一些开销。

    2. 并行性仅在数据处理阶段提供通过Map Reduce,而不是在客户端读取/写入文件期间。


    解决方案

    以上对文件写入的解释是正确的, DataNode可以同时读写数据。从 HDFS体系结构指南


    DataNode可以从流水线
    中的前一个接收数据,同时将数据转发到流水线中的下一个数据


    与传统文件系统相比,写入操作需要更多时间(由于带宽问题和一般开销),但不是3倍(假设复制因子为3)。

    I have a basic question regarding file writes and reads in HDFS.

    For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

    If this is the case, then isn't the time taken to write each block is more than a traditional file write due to -

    1. the replication factor (default is 3) and
    2. the write process is happening sequentially block after block.

    Please correct me if I am wrong in my understanding. Also, the following questions below:

    1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write (i.e. if the replication is set to 1) + some overhead involved in the distributed communication mechanism.
    2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.

    解决方案

    Though your above explanation of a file write is correct, a DataNode can read and write data simultaneously. From HDFS Architecture Guide:

    a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline

    A write operation takes more time than on a traditional file system (due to bandwidth issues and general overhead) but not as much as 3x (assuming a replication factor of 3).

    这篇关于Hadoop:HDFS文件写入&读的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆