为什么HDFS只写入一次并多次读取? [英] Why HDFS is write once and read multiple times?

查看:2240
本文介绍了为什么HDFS只写入一次并多次读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



在阅读Apache HDFS的同时,我了解到HDFS是一次写入文件系统。其他一些发行版(Cloudera)提供附加功能。了解这个设计决策背后的理由是很好的。在我看来,这个设计在Hadoop上创造了很多限制,并且适用于一些有限的问题(类似于日志分析的问题)。



专家评论会帮助我以更好的方式理解HDFS。

解决方案

有三个主要原因是HDFS具有它的设计, b
$ b

  • HDFS旨在通过严格拷贝Google GFS的设计,该设计旨在仅支持批量计算


  • HDFS最初不是用于其他任何用途,而是用于批量计算

  • 设计一个真正的分布式文件系统,可以支持高性能批处理操作以及实时文件修改很困难,超出了HDFS原始实现者的预算和经验水平。




Hadoop不能作为一个完整的读/写文件系统来构建是没有内在原因的。 MapR FS就是这方面的证明。但是实现这样的事情远远超出了原始Hadoop项目的范围和功能,HDFS原始设计中的架构决策基本上排除了改变这种限制。一个关键因素是NameNode的存在,因为HDFS要求通过NameNode进行所有元数据操作,例如文件创建,删除或文件长度扩展。 MapR FS通过完全消除NameNode并在整个集群中分发元数据来避免这种情况。



随着时间的推移,没有真正的可变文件系统变得越来越烦人Hadoop相关系统(如Spark和Flink)的工作量已经越来越多地转向运行,接近实时或实时操作。对这个问题的回应包括:


  • MapR FS。如前所述...... MapR实现了HDFS的全功能高性能重新实现,包括POSIX功能以及noSQL表和流API。这个系统已经在一些大型数据系统上运行了数年。

  • Kudu。 Cloudera基本上放弃了在HDFS之上实现可行的变异,并宣布Kudu没有时间表的全面可用性。 Kudu实现了类似于表格的结构,而不是完全一般的可变文件。

    Apache Nifi和商业版HDF。 Hortonworks也基本上放弃了HDFS,并宣布了将应用程序批量分发(由HDFS支持)和流媒体(由HDF支持)的策略。

  • Isilon 。 EMC实施HDFS有线协议作为其Isilon产品线的一部分。这允许Hadoop集群有两个存储孤岛,一个用于基于HDFS的大规模,高性能,低成本的批处理,另一个用于通过Isilon进行中等规模的可变文件访问。

  • 其他。有一些本质上已经停止的努力来补救HDFS的一次写入性质。这些包括KFS(Kosmix文件系统)等。这些都没有显着的采用。



  • I am a new learner of Hadoop.

    While reading about Apache HDFS I learned that HDFS is write once file system. Some other distributions ( Cloudera) provides append feature. It will be good to know rational behind this design decision. In my humble opinion, this design creates lots of limitations on Hadoop and make it suitable for limited set of problems( problems similar to log analytic).

    Experts comment will help me to understand HDFS in better manner.

    解决方案

    There are three major reasons that HDFS has the design it has,

    • HDFS was designed by slavishly copying the design of Google's GFS, which was intended to support batch computations only

    • HDFS was not originally intended for anything but batch computation

    • Design a real distributed file system that can support high performance batch operations as well as real-time file modifications is difficult and was beyond the budget and experience level of the original implementors of HDFS.

    There is no inherent reason that Hadoop couldn't have been built as a fully read/write file system. MapR FS is proof of that. But implementing such a thing was far outside of the scope and capabilities of the original Hadoop project and the architectural decisions in the original design of HDFS essentially preclude changing this limitation. A key factor is the presence of the NameNode since HDFS requires that all meta-data operations such as file creation, deletion or file length extensions round-trip through the NameNode. MapR FS avoids this by completely eliminating the NameNode and distributing meta-data throughout the cluster.

    Over time, not having a real mutable file system has become more and more annoying as the workload for Hadoop-related systems such as Spark and Flink have moved more and more toward operational, near real-time or real-time operation. The responses to this problem have included

    • MapR FS. As mentioned ... MapR implemented a fully functional high performance re-implementation of HDFS that includes POSIX functionality as well as noSQL table and streaming API's. This system has been in performance for years at some of the largest big data systems around.

    • Kudu. Cloudera essentially gave up on implementing viable mutation on top of HDFS and has announced Kudu with no timeline for general availability. Kudu implements table-like structures rather than fully general mutable files.

    • Apache Nifi and the commercial version HDF. Hortonworks also has largely given up on HDFS and announced their strategy as forking applications into batch (supported by HDFS) and streaming (supported by HDF) silos.

    • Isilon. EMC implemented the HDFS wire protocol as part of their Isilon product line. This allows Hadoop clusters to have two storage silos, one for large-scale, high-performance, cost-effective batch based on HDFS and one for medium-scale mutable file access via Isilon.

    • other. There are a number of essentially defunct efforts to remedy the write-once nature of HDFS. These include KFS (Kosmix File System) and others. None of these have significant adoption.

    这篇关于为什么HDFS只写入一次并多次读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆