Hadoop HDFS中的块概念 [英] Concept of blocks in Hadoop HDFS

查看:171
本文介绍了Hadoop HDFS中的块概念的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Hadoop中的块有一些疑问.我读到Hadoop使用HDFS,它将创建特定大小的块.

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.

第一个问题:这些块是否存在于像NTFS这样的普通文件系统上的硬盘上,也就是说,我们可以看到托管文件系统(NTFS)上的块,还是只能使用hadoop命令看到它们? ?

First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

第二个问题:hadoop是否在运行任务之前创建了块,即只要有文件,块就从头开始存在,或者hadoop仅在运行任务时才创建块.

Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

第三个问题,是否在分割之前(即InputFormat类的getSplits方法)确定并创建块,而不管分割的数量是多少还是取决于分割的时间?

Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

第四题运行任务前后的块是否相同或取决于配置,是否有两种类型的块,一种用于存储文件,一种用于对文件进行分组并发送通过网络连接到数据节点以执行任务?

Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

推荐答案

1.在正常文件系统(例如NTFS)上的硬盘上是否存在物理上的块,即我们可以在托管文件系统(NTFS)上看到这些块还是只能使用hadoop命令看到它们?

1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

是的.块在物理上存在.您可以使用hadoop fsck /path/to/file -files -blocks

Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks

有关查看块的命令,请参考以下SE问题:

Refer below SE questions for commands to view blocks :

在hadoop

2.hadoop是否在运行任务之前创建了块,即只要有文件就从头开始存在块,或者hadoop仅在运行任务时才创建块.

2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

Hadoop =分布式存储( HDFS )+分布式处理( MapReduce& Yarn).

Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).

MapReduce作业可处理输入拆分=>输入拆分是从Datanodes中的Data块创建的.数据块是在文件的写操作期间创建的.如果要在现有文件上运行作业,则在Map操作期间创建作业和InputSplits之前会预先创建数据块.您可以将数据块视为物理实体,将InputSplit视为逻辑实体. Mapreduce作业不会更改输入数据块. Reducer生成输出数据作为新的数据块.

A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.

映射器处理输入拆分并将输出发送到 Reducer 作业.

Mapper process input splits and emit output to Reducer job.

3.第三个问题,是否在分割之前(即InputFormat类的getSplits方法)确定并创建块,而不管分割的数量是多少还是取决于分割的时间?

3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

物理DFS块的输入已经可用. MapReduce作业可在InputSplit中使用.块和InputSplits可能相同也可能不同.块是物理实体,而InputSplit是逻辑实体.请参阅下面的SE问题以获取更多详细信息:

Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :

Hadoop如何执行输入拆分?

4.第四题运行任务前后的块是否相同或取决于配置,是否存在两种类型的块:一种用于存储文件,一种用于对文件进行分组并通过网络将其发送到数据节点执行任务?

4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

映射器输入:输入块已存在.映射过程从输入块/拆分开始,输入/拆分在开始Mapper作业之前已存储在HDFS中.

Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.

映射器输出:不存储在HDFS中,并且将中间结果存储在复制因子X大于1的HDFS上没有意义.

Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.

减速器输出:减速器输出存储在HDFS中.块数取决于减速器输出数据的大小.

Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.

这篇关于Hadoop HDFS中的块概念的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆