关于MR输入拆分 [英] About MR inputsplit

查看:186
本文介绍了关于MR输入拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,将文件复制到HDFS时的文件拆分和为映射器输入的文件输入拆分完全是两种不同的方法.

As i understood that File splitting at the time of copying the file into HDFS and input splits on file for mapper input are entirely two different approaches.

这是我的问题-

假设我的File1大小为128MB,将其分成两个块并存储在hadoop集群中的两个不同的数据节点(Node1,Node2)中.我想在此文件上运行MR作业,并得到两个输入拆分,其大小分别为70MB和58 MB.第一个映射器将通过获取输入的拆分数据(大小为70 MB)在node1上运行,但是Node1仅具有64MB数据,其余6MB数据显示在Node2中.

Suppose if my File1 size is 128MB which was split ted into two blocks and stored in two different data nodes (Node1,Node2) in hadoop cluster. I want to run MR job on this file and got two input splits of the sizes are 70MB and 58 MB respectively. First mapper will run on node1 by taking the input split data (Of size 70 MB) but Node1 has 64MB data only and remaining 6 MB data presented in Node2.

要在Node1上完成Map任务,hadoop是否将6MB的数据从Node2传输到Node1?如果是,则Node1没有足够的存储空间来存储Node2的6MB数据.

To complete Map task on Node1, Does hadoop transfer 6MB of data from Node2 to Node1? if yes, what if Node1 do not have enough storage to store 6MB data from Node2.

如果我感到尴尬,我深表歉意.

My apologies if my concern is awkward.

推荐答案

将在节点1中写入64 MB的数据,并在节点2中写入6 MB的数据.

64 MB of data will be written in Node 1 and 6 MB of data will be written in Node 2.

Map Reduce算法不适用于文件的物理块.它适用于逻辑输入拆分.输入拆分取决于记录的写入位置.一条记录可能跨越两个映射器.

Map Reduce algorithm does not work on physical blocks of the file. It works on logical input splits. Input split depends on where the record was written. A record may span two Mappers.

在您的示例中,假定记录在63 KB数据之后开始,并且记录长度为2 MB.在这种情况下,1 MB是节点1的一部分,其他1 MB是节点2的一部分.在Map操作期间,其他1 MB的数据将从节点2传输到节点1.

In your example, assume that record start after 63 KB of data and length of record is 2 MB. In that case, 1 MB is part of Node 1 and other 1 MB is part of Node 2. Other 1 MB of data will be transferred from Node 2 to Node 1 during Map operation.

请看下面的图片,以更好地理解

Have a look at below picture for better understanding of logical split Vs physical blocks

看看一些SE问题:

Hadoop处理记录如何跨块边界拆分?

关于Hadoop/HDFS文件分离

MapReduce数据处理由输入拆分的概念驱动.为特定应用程序计算的输入拆分数量决定了映射器任务的数量.

MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.

在可能的情况下,将每个映射器任务分配给存储输入拆分的从属节点.资源管理器(如果您使用的是Hadoop 1,则为JobTracker)会尽力确保输入拆分在本地处理.

Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally.

如果由于输入拆分跨越数据节点的边界而导致无法实现数据局部性,则某些数据将从一个数据节点传输到另一数据节点.

If data locality can't be achieved due to input splits crossing boundaries of data nodes, some data will be transferred from one Data node to other Data node.

这篇关于关于MR输入拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆