在Hadoop中分割大小与块大小 [英] Split size vs Block size in Hadoop

查看:202
本文介绍了在Hadoop中分割大小与块大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hadoop中分割大小和块大小之间的关系是什么?正如我在这个中阅读的,分割大小必须是块大小的n倍(n是整数且n> 0),这是正确的吗?在拆分大小和块大小之间是否存在任何必须的关系?

解决方案

在HDFS体系结构中,存在块的概念。 HDFS使用的典型块大小为64 MB。当我们把一个大文件放到HDFS中时,它被分成64 MB块(根据块的默认配置),假设你有一个1GB的文件,并且你想把这个文件放在HDFS中,那么会有1GB / 64MB =
16分割/块,这些块将分布在DataNodes中。这些块/块将根据您的群集配置位于不同的DataNode上。



数据拆分基于文件偏移量进行。分割文件并将其存储到不同块中的目标是并行处理和故障转移数据。



区块大小和分割大小之间的区别

拆分是逻辑拆分这些数据基本上是在使用Map / Reduce程序或Hadoop Ecosystem上的其他数据处理技术进行数据处理期间使用的。分割大小是用户定义的值,您可以根据您的数据量(您处理的数据量)选择您自己的分割大小。

分割基本上用于控制Map / Reduce程序中的Mapper数量。如果您尚未在Map / Reduce程序中定义任何输入拆分大小,则默认的HDFS块拆分将被视为输入拆分。

示例:

假设你有一个100MB的文件,而HDFS的默认块配置是64MB,那么它将被分割成2个块并占用2个块。现在你有一个Map / Reduce程序来处理这些数据,但是你没有指定任何输入分割,然后根据块的数量(2块)输入分割将被考虑用于Map / Reduce处理,并且2个映射器将被分配给这个

但是,假设你在Map / Reduce程序中指定了拆分大小(比如说100MB),那么两个块(2块)将被视为一个单独的拆分对于Map / Reduce处理,1 Mapper将分配给这个工作。



假设你在Map / Reduce程序中指定了拆分大小(比如说25MB) Map / Reduce程序将会有4个输入分割,4个Mapper将分配给这个作业。



结论:


  1. 拆分是输入数据的逻辑分割,而块是数据的物理分割。
  2. HDFS默认块如果未指定输入拆分,则size是默认拆分大小。
  3. 拆分是用户定义的,用户可以c在其Map / Reduce程序中控制分割大小。
  4. 一个分割可以映射到多个块,并且可以有一个块的多个分割

  5. 地图任务(Mapper)的数量等于分割数量。


What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?

解决方案

In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different different DataNode based on your cluster configuration.

Data splitting happens based on file offsets. The goal of splitting of file and store it into different blocks, is parallel processing and fail over of data.

Difference between block size and split size.

Split is logical split of the data, basically used during data processing using Map/Reduce program or other dataprocessing techniques on Hadoop Ecosystem. Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing).

Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

Example:

Suppose you have a file of 100MB and HDFS default block configuration is 64MB, then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.

But suppose, you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.

Suppose, you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.

Conclusion:

  1. Split is a logical division of the input data while block is a physical division of data.
  2. HDFS default block size is default split size if input split is not specified.
  3. Split is user defined and user can control split size in his Map/Reduce program.
  4. One split can be mapping to multiple blocks and there can be multiple split of one block.
  5. The number of map tasks (Mapper) are equal to the number of splits.

这篇关于在Hadoop中分割大小与块大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆