什么是hadoop中的序列文件? [英] What is sequence file in hadoop?
问题描述
我是Map-reduce的新手,我想了解什么是序列文件数据输入?我在Hadoop的书中学习过,但很难理解。
首先,我们应该了解SequenceFile会遇到什么问题来解决,然后SequenceFile如何帮助解决问题。在HDFS中
- SequenceFile是Hadoop中小文件问题的解决方案之一。
- 小文件比HDFS块大小(128MB)小很多。
- HDFS中的每个文件,目录,块都表示为object,并占用150个字节。
- 1000万个文件,将使用大约3GB的NameNode内存。
-
$ b在MapReduce中
ul>
Map任务通常每次处理一个输入块(使用默认的FileInputFormat)。
- 这些文件是一个较大的逻辑文件。
- 这些文件本身很小,例如图像。
- 首先编写一个程序来连接小文件(见Nathan Marz的发布关于称为合并器的工具就是这样)
- 对于第二个,需要某种容器来以某种方式对这些文件进行分组。
- HARs可能最好仅用于存档目的。 SequenceFile
- SequenceFile的概念是将每个小文件放到一个更大的单个文件中。$ b $例如,假设有10,000个100KB的文件,那么我们可以编写一个程序,将它们放入如下的单个SequenceFile中,在这里你可以使用filename作为键和内容作为值。 / p>
SequenceFile文件格式http://img.blog。
一些好处:
- 在使用SequenceFile之前,10,000个对象在NameNode中占用大约4.5MB的RAM。
- 在使用SequenceFile,1GB SequenceFile和8个HDFS块后,这些对象在NameNode中占用大约3.6KB的RAM。
li> NameNode上需要的内存数量较少。继续使用10,000 100KB文件示例
- SequenceFile是可拆分的,因此适用于MapReduce。
- SequenceFile支持压缩。 / li>
支持的压缩文件结构取决于压缩类型。
- 未压缩
-
记录压缩:在记录添加到文件时压缩每条记录。
record_compress_seq http://img.blog.csdn.net/20151213182753789 -
数据块压缩
这里写图片描述http://img.blog.csdn.net/20151213183017236
- 等待数据达到块大小进行压缩。 / li>
- 块压缩比Record压缩提供更好的压缩比。
- 在使用SequenceFile时,块压缩通常是首选选项。
- 此处的块与HDFS或文件系统块无关 。
$ b - 等待数据达到块大小进行压缩。 / li>
I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.
解决方案First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.
In HDFS
- SequenceFile is one of the solutions to small file problem in Hadoop.
- Small file is significantly smaller than the HDFS block size(128MB).
- Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
- 10 million files, would use about 3 gigabytes of memory of NameNode.
- A billion files is not feasible.
In MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat).
The more the number of files is, the more number of Map task need and the job time can be much more slower.
Small file scenarios
- The files are pieces of a larger logical file.
- The files are inherently small, for example, images.
These two cases require different solutions.
- For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
- For the second one, some kind of container is needed to group the files in some way.
Solutions in Hadoop
HAR files
- HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
- HARs are probably best used purely for archival purposes.
SequenceFile
- The concept of SequenceFile is to put each small file to a larger single file.
For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.
SequenceFile File Layout http://img.blog.csdn.net/20151213123516719
Some benefits:
- A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
- Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
- After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
- SequenceFile is splittable, so is suitable for MapReduce.
- SequenceFile is compression supported.
- A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
Supported Compressions, the file structure depends on the compression type.
- Uncompressed
Record-Compressed: Compresses each record as it’s added to the file. record_compress_seq http://img.blog.csdn.net/20151213182753789
Block-Compressed 这里写图片描述 http://img.blog.csdn.net/20151213183017236
- Waits until data reaches block size to compress.
- Block compression provide better compression ratio than Record compression.
- Block compression is generally the preferred option when using SequenceFile.
- Block here is unrelated to HDFS or filesystem block.
这篇关于什么是hadoop中的序列文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- SequenceFile的概念是将每个小文件放到一个更大的单个文件中。$ b $例如,假设有10,000个100KB的文件,那么我们可以编写一个程序,将它们放入如下的单个SequenceFile中,在这里你可以使用filename作为键和内容作为值。 / p>
小文件场景
这两种情况需要不同的解决方案。
Hadoop解决方案
HAR文件
$ b
- $ b HAR( Hadoop档案)已引入缓解大量文件给nam带来压力的问题enode的内存。