gzip文件如何存储在HDFS中 [英] How gzip file gets stored in HDFS
问题描述
- 这个文件将如何存储在HDFS中(块大小为64MB)
通过此
查看此演示文稿了解更多信息细节。
HDFS storage support compression format to store compressed file. I know that gzip compression doesn't support splinting. Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. Now my question is:
- How this file will get stored in HDFS (Block size is 64MB)
From this link I came to know that The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks.
But I couldn't understand it completely and looking for broad explanation.
More doubts from gzip compressed file:
- How many block will be there for this 1GB gzip compressed file.
- Will it go on multiple datanode ?
- How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
- What is
DEFLATE
algorithm? - Which algorithm is applied while reading the gzip compressed file?
I am looking here broad and detailed explanation.
How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?
All DFS blocks will be stored in single Datanode. If your block size is 64 MB and file is 1 GB, the Datanode
with 16 DFS blocks ( 1 GB / 64 MB = 15.625) will store 1 GB file.
How many block will be there for this 1GB gzip compressed file.
1 GB / 64 MB = 15.625 ~ 16 DFS blocks
How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
Same as of any other file. If the file is splittable, no change. If the file is not splittable, Datanodes with required number of blocks will be identified. In this case, 3 datanodes with 16 available DFS blocks.
From source code of this link : http://grepcode.com/file_/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java/?v=source
and
/** The class is responsible for choosing the desired number of targets
* for placing block replicas.
* The replica placement strategy is that if the writer is on a datanode,
* the 1st replica is placed on the local machine,
* otherwise a random datanode. The 2nd replica is placed on a datanode
* that is on a different rack. The 3rd replica is placed on a datanode
* which is on the same rack as the first replca.
*/
What is DEFLATE algorithm?
DELATE is the algorithm to uncompress zipped files of GZIP format.
Have a look at this slide to have understanding of other algorithms for different variants of zip files.
Have a look at this presentation for more details.
这篇关于gzip文件如何存储在HDFS中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!