gzip文件如何存储在HDFS中 [英] How gzip file gets stored in HDFS

查看:413
本文介绍了gzip文件如何存储在HDFS中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HDFS存储支持压缩格式来存储压缩文件。我知道gzip压缩不支持拼夹。试想现在该文件是压缩大小为1 GB的gzip压缩文件。现在我的问题是:


  1. 这个文件将如何存储在HDFS中(块大小为64MB)

通过此



查看此演示文稿了解更多信息细节。

HDFS storage support compression format to store compressed file. I know that gzip compression doesn't support splinting. Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. Now my question is:

  1. How this file will get stored in HDFS (Block size is 64MB)

From this link I came to know that The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks.

But I couldn't understand it completely and looking for broad explanation.

More doubts from gzip compressed file:

  1. How many block will be there for this 1GB gzip compressed file.
  2. Will it go on multiple datanode ?
  3. How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
  4. What is DEFLATE algorithm?
  5. Which algorithm is applied while reading the gzip compressed file?

I am looking here broad and detailed explanation.

解决方案

How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?

All DFS blocks will be stored in single Datanode. If your block size is 64 MB and file is 1 GB, the Datanode with 16 DFS blocks ( 1 GB / 64 MB = 15.625) will store 1 GB file.

How many block will be there for this 1GB gzip compressed file.

1 GB / 64 MB = 15.625 ~ 16 DFS blocks

How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)

Same as of any other file. If the file is splittable, no change. If the file is not splittable, Datanodes with required number of blocks will be identified. In this case, 3 datanodes with 16 available DFS blocks.

From source code of this link : http://grepcode.com/file_/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java/?v=source

and

http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/0.22.0/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java/?v=source

/** The class is responsible for choosing the desired number of targets
 * for placing block replicas.
 * The replica placement strategy is that if the writer is on a datanode,
 * the 1st replica is placed on the local machine, 
 * otherwise a random datanode. The 2nd replica is placed on a datanode
 * that is on a different rack. The 3rd replica is placed on a datanode
 * which is on the same rack as the first replca.
 */

What is DEFLATE algorithm?

DELATE is the algorithm to uncompress zipped files of GZIP format.

Have a look at this slide to have understanding of other algorithms for different variants of zip files.

Have a look at this presentation for more details.

这篇关于gzip文件如何存储在HDFS中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆