如何从eclipse调试hadoop mapreduce工作? [英] How to debug hadoop mapreduce jobs from eclipse?

查看:166
本文介绍了如何从eclipse调试hadoop mapreduce工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在一台单机,本地设置中运行hadoop,我正在寻找一种很好的,无痛的方法来调试eclipse中的映射器和还原器。 Eclipse 在运行mapreduce任务时没有问题。但是,当我去调试,它给我这个错误:


12/03/28 14:03:23 WARN mapred.JobClient :没有工作jar文件集。可能找不到用户类。请参阅JobConf(Class)或JobConf#setJar(String)。


好的,所以我做了一些研究。显然,我应该使用eclipse的远程调试工具,并添加到我的 hadoop-env.sh

  -agentlib:jdwp = transport = dt_socket,server = y,suspend = y,address = 5000 

我这样做,我可以通过我的代码在eclipse。唯一的问题是,由于suspend = y,我不能使用命令行中的hadoop命令来做一些像查看作业队列的操作;它挂起来,我想象中,因为它正在等待调试器附加。此外,当我处于这种模式时,我无法运行hbase shell,可能原因相同。



所以基本上,如果我想要在调试模式和正常模式之间来回翻转,我需要更新 hadoop-env.sh 并重启机器。主要疼痛。所以我有几个问题:


  1. 是否有更简单的方法在eclipse中执行调试mapreduce作业?

    / li>
  2. eclipse可以如何运行我的mapreduce作业,但调试时需要使用远程调试?


  3. p>有没有办法告诉hadoop对mapreduce作业使用远程调试,但是在所有其他任务的正常模式下运行? (例如hadoop queuehbase shell


  4. 在不重新启动机器的情况下切换 hadoop-env.sh 配置的方法? hadoop-env.sh在默认情况下不可执行。


  5. 这是一个更一般的问题:当我在本地模式下运行hadoop时究竟发生了什么?我的机器上是否有任何永远在线和执行hadoop作业的进程?或者,当我从命令行运行hadoop命令时,hadoop只能做些事情吗?当我从eclipse运行mapreduce工作时,eclipse做什么?为了使我的项目工作,我必须在 pom.xml 中引用 hadoop-core eclipse将工作提交到我安装的hadoop实例,还是以某种方式从我的maven缓存中的 hadoop-core-1.0.0.jar 中运行?


这是我的主要课程:

  public class Main {
public static void main(String [] args)throws Exception {
Job job = new Job();
job.setJarByClass(Main.class);
job.setJobName(FirstStage);

FileInputFormat.addInputPath(job,new Path(/ home / sangfroid / project / in));
FileOutputFormat.setOutputPath(job,new Path(/ home / sangfroid / project / out));

job.setMapperClass(FirstStageMapper.class);
job.setReducerClass(FirstStageReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true)?0:1);
}
}


解决方案

脚本中 / bin / hadoop hadoop-env.sh )中的更改。检查看看是什么命令被触发。如果命令是 jar ,那么只需添加远程调试配置。

  if [$ COMMAND=jar];然后
exec$ JAVA-Xdebug -Xrunjdwp:transport = dt_socket,server = y,address = 8999 $ JAVA_HEAP_MAX $ HADOOP_OPTS $ CLASS$ @
else
exec$ JAVA $ JAVA_HEAP_MAX $ HADOOP_OPTS $ CLASS$ @
fi


I'm running hadoop in a single-machine, local-only setup, and I'm looking for a nice, painless way to debug mappers and reducers in eclipse. Eclipse has no problem running mapreduce tasks. However, when I go to debug, it gives me this error :

12/03/28 14:03:23 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

Okay, so I do some research. Apparently, I should use eclipse's remote debugging facility, and add this to my hadoop-env.sh :

-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000

I do that and I can step through my code in eclipse. Only problem is that, because of the "suspend=y", I can't use the "hadoop" command from the command line to do things like look at the job queue; it hangs, I'm imagining because it's waiting for a debugger to attach. Also, I can't run "hbase shell" when I'm in this mode, probably for the same reason.

So basically, if I want to flip back and forth between "debug mode" and "normal mode", I need to update hadoop-env.sh and restart my machine. Major pain. So I have a few questions :

  1. Is there an easier way to do debug mapreduce jobs in eclipse?

  2. How come eclipse can run my mapreduce jobs just fine, but for debugging I need to use remote debugging?

  3. Is there a way to tell hadoop to use remote debugging for mapreduce jobs, but to operate in normal mode for all other tasks? (such as "hadoop queue" or "hbase shell").

  4. Is there an easier way to switch hadoop-env.sh configurations without rebooting my machine? hadoop-env.sh is not executable by default.

  5. This is a more general question : what exactly is happening when I run hadoop in local-only mode? Are there any processes on my machine that are "always on" and executing hadoop jobs? Or does hadoop only do things when I run the "hadoop" command from the command line? What is eclipse doing when I run a mapreduce job from eclipse? I had to reference hadoop-core in my pom.xml in order to make my project work. Is eclipse submitting jobs to my installed hadoop instance, or is it somehow running it all from the hadoop-core-1.0.0.jar in my maven cache?

Here is my Main class :

public class Main {
      public static void main(String[] args) throws Exception {     
        Job job = new Job();
        job.setJarByClass(Main.class);
        job.setJobName("FirstStage");

        FileInputFormat.addInputPath(job, new Path("/home/sangfroid/project/in"));
        FileOutputFormat.setOutputPath(job, new Path("/home/sangfroid/project/out"));

        job.setMapperClass(FirstStageMapper.class);
        job.setReducerClass(FirstStageReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
}

解决方案

Make changes in /bin/hadoop (hadoop-env.sh) script. Check to see what command has been fired. If the command is jar, then only add remote debug configuration.

if [ "$COMMAND" = "jar" ] ; then
  exec "$JAVA" -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999 $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
else
  exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
fi

这篇关于如何从eclipse调试hadoop mapreduce工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆