HDFS中的存储格式 [英] Storage format in HDFS

查看:303
本文介绍了HDFS中的存储格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

HDFS如何存储数据?



我想以压缩的方式存储大文件。



例如:我有一个1.5 GB的文件,默认复制因子为3。



它需要(1.5)* 3 = 4.5 GB空间。

我相信目前没有隐式数据压缩发生。

是否有一种技术可以压缩文件并将其存储在HDFS中以节省磁盘空间?

解决方案

HDFS将任何文件存储在多个块中。块大小可以根据每个文件进行配置,但是有一个默认值(比如64/128/256 MB)

因此,给定一个1.5 GB的文件,大小为128 MB,hadoop会将该文件分解成大约12个块(12 x 128 MB〜= 1.5GB)。如果您的数据压缩良好(如文本文件),那么您可以压缩文件并将压缩文件存储在HDFS中 - 这同样适用于上述情况,所以如果1.5GB文件压缩到500MB,那么这将被存储为4个块。然而,使用压缩时需要考虑的一件事是压缩方法是否支持分割文件 - 也就是说,你可以随机寻找文件中的位置并恢复压缩的流(例如GZIp不支持分割,BZip2)。

即使该方法不支持分割,hadoop仍然会将该文件存储在多个块中,但是您将失去数据局部性的某些好处,因为这些块很可能会散布在您的群集中。



在地图缩减代码中,Hadoop默认安装了大量压缩编解码器,并且会自动识别某些文件扩展名(例如GZip文件的.gz),从而不必担心输入/输出是否需要压缩。



希望这有意义



编辑在回复评论时提供一些附加信息: 当从Map输出到HDFS时减少工作,查看FileOutputFormat的API,特别是以下方法:
$ b $ ul

  • setCompressOutput(Job,boolean)
  • setOutputCompressorClass(Job,Class)


    上传文件到HDFS时,是的,它们应该是pre-compr并且与该压缩类型相关的文件扩展名(开箱即用,hadoop支持带有.gz扩展名的gzip,所以file.txt.gz会表示一个gzip文件)。

    How Does HDFS store data?

    I want to store huge files in a compressed fashion.

    E.g : I have a 1.5 GB of file, with default replication factor of 3.

    It requires (1.5)*3 = 4.5 GB of space.

    I believe currently no implicit compression of data takes place.

    Is there a technique to compress the file and store it in HDFS to save disk space ?

    解决方案

    HDFS stores any file in a number of 'blocks'. The block size is configurable on a per file basis, but has a default value (like 64/128/256 MB)

    So given a file of 1.5 GB, and block size of 128 MB, hadoop would break up the file into ~12 blocks (12 x 128 MB ~= 1.5GB). Each block is also replicated a configurable number of times.

    If your data compresses well (like text files) then you can compress the files and store the compressed files in HDFS - the same applies as above, so if the 1.5GB file compresses to 500MB, then this would be stored as 4 blocks.

    However, one thing to consider when using compression is whether the compression method supports splitting the file - that is can you randomly seek to a position in the file and recover the compressed stream (GZIp for example does not support splitting, BZip2 does).

    Even if the method doesn't support splitting, hadoop will still store the file in a number of blocks, but you'll lose some benefit of 'data locality' as the blocks will most probably be spread around your cluster.

    In your map reduce code, Hadoop has a number of compression codecs installed by default, and will automatically recognize certain file extensions (.gz for GZip files for example), abstracting you away from worrying about whether the input / output needs to be compressed.

    Hope this makes sense

    EDIT Some additional info in response to comments:

    When writing to HDFS as output from a Map Reduce job, see the API for FileOutputFormat, in particular the following methods:

    • setCompressOutput(Job, boolean)
    • setOutputCompressorClass(Job, Class)

    When uploading files to HDFS, yes they should be pre-compressed, and with the associated file extension for that compression type (out of the box, hadoop supports gzip with the .gz extension, so file.txt.gz would denote a gzipped file)

    这篇关于HDFS中的存储格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆