在Hadoop中,框架在哪里将Map任务的输出保存在普通的Map-Reduce应用程序中? [英] In Hadoop where does the framework save the output of the Map task in a normal Map-Reduce Application?

查看:212
本文介绍了在Hadoop中,框架在哪里将Map任务的输出保存在普通的Map-Reduce应用程序中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出Map任务的输出在被Reduce任务使用之前保存到磁盘的哪个地方。



注意: - 使用的版本是带有新API的Hadoop 0.20.204



例如,覆盖Map类中的map方法时:

  public void map(LongWritable key,Text value,Context context)throws IOException,InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
context.write(word,one);
}

//启动新Job的代码。




$ b我有兴趣了解context.write( )最终写入数据。到目前为止,我已经遇到了:

  FileOutputFormat.getWorkOutputPath(context); 

这给了我在hdfs上的以下位置:

  hdfs:// localhost:9000 / tmp / outputs / 1 / _temporary / _attempt_201112221334_0001_m_000000_0 

当我尝试将它用作另一项工作的输入时,它会给我以下错误:

  org.apache.hadoop.mapreduce.lib.input.InvalidInputException:输入路径不存在:hdfs:// localhost:9000 / tmp / outputs / 1 / _temporary / _attempt_201112221334_0001_m_000000_0 

注意:作业是在Mapper中启动的,因此在技术上,Mapper任务写入输出的临时文件夹存在新工作开始。然后,它仍然说输入路径不存在。



临时输出写入到哪里的任何想法?或者,也许在一个Map和Reduce阶段的作业期间,我可以找到Map任务的输出的位置?

解决方案

所以,我已经知道了真正发生了什么。



映射器的输出被缓冲,直到达到其大小的80%并在那时开始将结果转储到本地磁盘并继续将项目放入缓冲区。

我希望获得映射器的中间输出,并将其用作另一个作业的输入,而映射器仍在运行。事实证明,如果不大量修改hadoop 0.20.204部署,这是不可能的。系统工作的方式甚至是在地图上下文中指定的所有东西之后:

  map .... {
setup(上下文)


cleanup(context)
}

并且调用清理,仍然没有倾销到临时文件夹。

之后,整个Map计算一切最终都会合并并转储到磁盘,并成为Reducer之前的Shuffling和Sorting阶段的输入。



到目前为止,我已经阅读并查看过,输出应该是最终的临时文件夹,这是我之前猜测的那个文件夹。

  FileOutputFormat.getWorkOutputPath(context)

我设法以我不同的方式做我想做的事。无论如何
有任何问题可能与此有关,请告知我。


I am trying to find out where does the output of a Map task is saved to disk before it can be used by a Reduce task.

Note: - version used is Hadoop 0.20.204 with the new API

For example, when overwriting the map method in the Map class:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }

    // code that starts a new Job.

}

I am interested to find out where does context.write() ends up writing the data. So far i've ran into the:

FileOutputFormat.getWorkOutputPath(context);

Which gives me the following location on hdfs:

hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

When i try to use it as input for another job it gives me the following error:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

Note: the job is started in the Mapper, so technically, the temporary folder where the Mapper task is writing it's output exists when the new job begins. Then again, it still says that the input path does not exist.

Any ideas to where the temporary output is written to? Or maybe what is the location where i can find the output of a Map task during a job that has both a Map and a Reduce stage?

解决方案

So, I've figured out what is really going on.

The output of the mapper is buffered until it gets to about 80% of its size, and at that point it begins to dump the result to its local disk and continues to admit items into the buffer.

I wanted to get the intermediate output of the mapper and use it as input for another job, while the mapper was still running. It turns out that this is not possible without heavily modifying the hadoop 0.20.204 deployment. The way the system works is even after all the things that are specified in the map context:

map .... {
  setup(context)
  .
  .
  cleanup(context)
}

and the cleanup is called, there is still no dumping to the temporary folder.

After, the whole Map computation everything eventually gets merged and dumped to disk and becomes the input for the Shuffling and Sorting stages that precede the Reducer.

So far from all I've read and looked at, the temporary folder where the output should be eventually, is the one that I was guessing beforehand.

FileOutputFormat.getWorkOutputPath(context)

I managed to the what I wanted to do in a different way. Anyway any questions there might be about this, let me know.

这篇关于在Hadoop中,框架在哪里将Map任务的输出保存在普通的Map-Reduce应用程序中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆