Hadoop 如何执行输入拆分? [英] How does Hadoop perform input splits?

查看:32
本文介绍了Hadoop 如何执行输入拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个涉及 Hadoop/HDFS 的概念性问题.假设您有一个包含 10 亿行的文件.并且为了简单起见,让我们考虑每一行的形式 <k,v> 其中 k 是该行从开头的偏移量,value 是该行的内容.

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

现在,当我们说要运行 N 个 map 任务时,框架是否将输入文件拆分为 N 个拆分并在该拆分上运行每个 map 任务?或者我们是否必须编写一个分区函数来执行 N 个拆分并在生成的拆分上运行每个映射任务?

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

我想知道的是,拆分是在内部完成还是我们必须手动拆分数据?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

更具体地说,每次调用 map() 函数时,它的 Key key 和 Value val 参数是什么?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

谢谢,迪帕克

推荐答案

InputFormat 负责提供拆分.

通常,如果您有 n 个节点,HDFS 会将文件分发到所有这 n 个节点上.如果你开始一个工作,默认会有 n 个映射器.借助 Hadoop,机器上的映射器将处理存储在该节点上的部分数据.我认为这叫做机架意识.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

长话短说:将数据上传到 HDFS 并启动 MR 作业.Hadoop 会关心优化的执行.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

这篇关于Hadoop 如何执行输入拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆