Hadoop 如何执行输入拆分? [英] How does Hadoop perform input splits?
问题描述
这是一个涉及 Hadoop/HDFS 的概念性问题.假设您有一个包含 10 亿行的文件.并且为了简单起见,让我们考虑每一行的形式 <k,v>
其中 k 是该行从开头的偏移量,value 是该行的内容.
This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v>
where k is the offset of the line from the beginning and value is the content of the line.
现在,当我们说要运行 N 个 map 任务时,框架是否将输入文件拆分为 N 个拆分并在该拆分上运行每个 map 任务?或者我们是否必须编写一个分区函数来执行 N 个拆分并在生成的拆分上运行每个映射任务?
Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?
我想知道的是,拆分是在内部完成还是我们必须手动拆分数据?
All i want to know is, whether the splits are done internally or do we have to split the data manually?
更具体地说,每次调用 map() 函数时,它的 Key key 和 Value val
参数是什么?
More specifically, each time the map() function is called what are its Key key and Value val
parameters?
谢谢,迪帕克
推荐答案
InputFormat
负责提供拆分.
通常,如果您有 n 个节点,HDFS 会将文件分发到所有这 n 个节点上.如果你开始一个工作,默认会有 n 个映射器.借助 Hadoop,机器上的映射器将处理存储在该节点上的部分数据.我认为这叫做机架意识
.
In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness
.
长话短说:将数据上传到 HDFS 并启动 MR 作业.Hadoop 会关心优化的执行.
So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.
这篇关于Hadoop 如何执行输入拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!