HDFS中Hadoop块的概念 [英] Concept of Hadoop blocks in HDFS

查看:104
本文介绍了HDFS中Hadoop块的概念的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Hadoop中的块有一些疑问。我读过Hadoop使用HDFS,它会创建特定大小的块。



第一个问题块是否在硬盘上物理存在于普通文件像NTFS的系统,我们可以看到托管文件系统(NTFS)上的块,或者只有它可以使用hadoop命令看到?



第二个问题 hadoop在运行任务之前创建块,即每当有文件时,块就从头开始存在,或者hadoop仅在运行任务时创建块。



第三个问题不管分割数量还是取决于分割之后,分割之前是否会确定并创建块(即InputFormat类的getSplits方法)?

提问运行任务之前和之后的块是否相同或取决于配置,并且存在两种类型的块,一种用于存储文件,另一种用于将文件分组和将它们通过网络发送到数据节点以执行任务?

解决方案


存在于像NTFS这样的普通文件系统上的硬盘上,也就是说,我们可以看到托管文件系统(NTFS)上的块,还是只能使用hadoop命令看到它?

是的。块存在物理上。您可以使用像这样的命令:hadoop fsck / path / to / file -files -blocks



命令来查看块:




2. hadoop是否创建块在运行任务之前阻塞,即每当有文件时从块开始存在块,或者,hadoop仅在运行任务时创建块。


Hadoop =分布式存储在输入分割上工作=>输入分割是从Datanodes中的数据块创建的。数据块是在文件的写入操作过程中创建的。如果您正在现有文件上运行作业,则在Map作业期间创建作业和InputSplits之前预先创建数据块。您可以将数据块视为物理实体,将InputSplit看作逻辑实体。 Mapreduce作业不会更改输入数据块。 Reducer生成输出数据作为新的数据块。

Mapper 过程输入将输出分割并发送到 Reducer 作业。 / p>


3.第三个问题在分割之前(即InputFormat类的getSplits方法),将决定并创建块,而不管分割数量或之后取决于分割?


输入已经可用于物理DFS块。 MapReduce作业在InputSplit中工作。 Blocks和InputSplits可能相同也可能不相同。块是一个物理实体,InputSplit是逻辑实体。有关更多详细信息,请参阅下面的SE问题:

Hadoop如何执行输入分割?


<4> 4.问题前面的块和在运行任务之后还是取决于配置,有两种类型的块,一种用于存储文件,另一种用于分组文件,并通过网络将它们发送到数据节点以执行任务?


映射器输入:输入块预先存在。映射过程从输入块/分割开始,在Mapper作业开始之前已存储在HDFS中。

映射器输出:未存储在HDFS中在复制因子X大于1的HDFS上存储中间结果没有任何意义。



Reducer输出:存储Reducer输出在HDFS中。块的数量取决于减速器输出数据的大小。

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.

First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

解决方案

1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks

Refer below SE questions for commands to view blocks :

Viewing the number of blocks for a file in hadoop

2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).

A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.

Mapper process input splits and emit output to Reducer job.

3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :

How does Hadoop perform input splits?

4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.

Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.

Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.

这篇关于HDFS中Hadoop块的概念的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆