Hadoop如何执行输入拆分? [英] How does Hadoop perform input splits?

查看:88
本文介绍了Hadoop如何执行输入拆分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个涉及Hadoop / HDFS的概念性问题。假设你有一个包含10亿行文件的文件。为了简单起见,我们考虑每一行的格式为< k,v> 其中k是从开始行的偏移量,值是线的内容。

现在,当我们说要运行N个map任务时,框架是否将输入文件拆分为N个split并在该split上运行每个map任务?或者我们是否必须编写一个分区函数来分割并运行分割生成的每个映射任务?



所有我想知道的是,拆分是在内部完成还是我们必须手动拆分数据?



更具体地说,每次调用map()函数时,它的 Key键和Value val 参数都是什么?



感谢,
Deepak

解决方案

InputFormat 一般来说,如果你有n个节点,HDFS将把这个文件分配到所有这n个节点上。如果你开始工作,默认情况下会有n个映射器。由于Hadoop,机器上的映射器将处理存储在此节点上的数据部分。我认为这被称为机架感知



因此,长话短说:上传HDFS中的数据并开始MR工作。 Hadoop会关注优化执行。


This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

Thanks, Deepak

解决方案

The InputFormat is responsible to provide the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

这篇关于Hadoop如何执行输入拆分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆