hadoop中的job客户端如何计算inputSplits [英] How job client in hadoop compute inputSplits

查看:16
本文介绍了hadoop中的job客户端如何计算inputSplits的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试深入了解 map reduce 架构.我正在咨询这个 http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ 文章.我对 mapreduce 框架的组件 JobClient 有一些疑问.我的问题是:

I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:

JObClient 如何计算数据的输入拆分?

How the JObClient Computes the input Splits on the data?

根据我所咨询的内容,Job Client 在运行作业时计算位于指定 HDFS 上的输入路径中的数据的输入拆分.文章说,然后 Job Client 将资源(jar 和计算输入拆分)复制到 HDFS.现在这是我的问题,当输入数据在 HDFS 中时,为什么 jobClient 将计算出的 inputsplits 复制到 HDFS 中.

According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.

让我们假设 Job Client 将输入拆分复制到 HDFS,现在当 JOb 提交到 Job Tracker 和 Job Tracker 时详细说明作业为什么它从 HDFS 检索输入拆分?

Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?

抱歉,如果我的问题不清楚.我是初学者.:)

Apologies if my question is not clear. I am a beginner. :)

推荐答案

不,JobClient 不会将输入拆分复制到 HDFS.您已经为自己引用了答案:

No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:

Job Client 根据输入路径中的数据计算输入拆分在运行作业时指定的 HDFS 上.那篇文章说约伯客户端将资源(jar 和计算的输入拆分)复制到 HDFS.

Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and computed input splits) to the HDFS.

输入本身依赖于集群.客户端仅计算从名称节点获得的元信息(块大小、数据长度、块位置).这些 computed 输入分割将元信息传递给任务,例如块偏移量和要计算的长度.

The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.

看看 org.apache.hadoop.mapreduce.lib.input.FileSplit,它包含文件路径、起始偏移量和单个任务将对其进行操作的块的长度输入.您可能还想查看的可序列化类是:org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.

Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input. The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.

将为每个将要运行的任务计算此元信息,并与 jars 一起复制到将实际执行此任务的节点.

This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.

这篇关于hadoop中的job客户端如何计算inputSplits的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆