hadoop的工作客户端如何计算inputSplits [英] How job client in hadoop compute inputSplits

查看:113
本文介绍了hadoop的工作客户端如何计算inputSplits的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图获得地图缩小体系结构的洞察力。我正在咨询这个 http://answers.oreilly.com/ topic / 2141-how-mapreduce-works-with-hadoop / 文章。我对mapreduce框架的组件JobClient有一些疑问。我的问题是:

JObClient如何计算输入分割数据?



根据以下内容我正在咨询的作业客户端计算输入分割位于输入路径中的数据,该数据位于运行作业时指定的HDFS上。文章说,然后作业客户端将资源(罐子和编译输入分割)复制到HDFS。现在,我的问题是,当输入数据在HDFS中时,为什么jobClient将计算出的inputsplits复制到HDFS中。

假设Job Client将输入分割复制到HDFS ,现在当JOB提交给Job Tracker并且Job跟踪器完成工作时,为什么它会从HDFS中检索输入分割?



如果我的问题不明确,道歉。我是一名初学者。 :)

解决方案

否JobClient不会将输入分割复制到HDFS。您已经为自己引用了您的答案:


作业客户端计算位于HDFS上输入路径
中的数据的输入拆分运行作业时指定。文章说,然后Job
Client将资源(罐子和计算输入分割)复制到HDFS中。

输入本身依赖于集群。客户端仅计算从namenode获得的元信息(块大小,数据长度,块位置)。这些计算输入拆分将元信息带入任务,例如,块偏移量和要计算的长度。

查看 org.apache.hadoop.mapreduce.lib.input.FileSplit ,它包含文件路径的起始偏移量和单个任务将作为其输入的块的长度。
您可能想要查看的可序列化类是: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo



将为每个将要运行的任务计算此元信息,并将其复制到实际执行此任务的节点。


I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:

How the JObClient Computes the input Splits on the data?

According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.

Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?

Apologies if my question is not clear. I am a beginner. :)

解决方案

No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:

Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and computed input splits) to the HDFS.

The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.

Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input. The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.

This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.

这篇关于hadoop的工作客户端如何计算inputSplits的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆