如何将外部jar添加到hadoop工作? [英] how to add external jar to hadoop job?

查看:140
本文介绍了如何将外部jar添加到hadoop工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Hadoop作业,其中映射器必须使用外部jar。



我试图通过hadoop命令中的-libjars参数将此jar传递给映射器的JVM



/ p>

  hadoop jar mrrunner.jar DAGMRRunner -libjars< path_to_jar> /colt.jar 

通过job.addFileToClassPath

  job.addFileToClassPath新路径(< path_to_jar> /colt.jar)); 

在HADOOP_CLASSPATH上。

  g1mihai @ hydra:/ home / g1mihai / $ echo $ HADOOP_CLASSPATH 
/colt.jar

这些方法都不起作用。这是我回来的堆栈跟踪。它抱怨的缺少的类是SparseDoubleMatrix1D在colt.jar中。



让我知道是否应该提供任何额外的调试信息。

  15/02/14 16:47:51信息mapred.MapTask:开始刷新地图输出
15/02/14 16:47:51信息mapred.LocalJobRunner:地图任务执行器完成。
15/02/14 16:47:51警告mapred.LocalJobRunner:job_local368086771_0001
java.lang.Exception:java.lang.NoClassDefFoundError:Lcern / colt / matrix / impl / SparseDoubleMatrix1D;
at org.apache.hadoop.mapred.LocalJobRunner $ Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner $ Job.run(LocalJobRunner.java:522)
引起:java.lang.NoClassDefFoundError:Lcern / colt / matrix / impl / SparseDoubleMatrix1D;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
at java.lang.Class.getDeclaredField(Class。 java:1951)
在java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
在java.io.ObjectStreamClass.access $ 700(ObjectStreamClass.java:72)
在java。 io.ObjectStreamClass $ 2.run(ObjectStreamClass.java:480)
在java.io.ObjectStreamClass $ 2.run(ObjectStreamClass.java:468)$ b $ java.util.AccessController.doPrivileged(Native Method)$ b $ < init>(ObjectStreamClass.java:468)$ b $在java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
在java.io.ObjectStreamClass .initNonProxy(ObjectStreamClass.java:602)
位于java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
位于java.io.ObjectInputStream.readClassDesc(ObjectInputSt
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java .io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at BoostConnector.ConnectCalculateBoost(BoostConnector.java:39)
at DAGMapReduceSearcher $ Map.map(DAGMapReduceSearcher.java:46)
at DAGMapReduceSearcher $ Map.map(DAGMapReduceSearcher.java:22)
在org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
在org.apache.hadoop.mapred.MapTask。 runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner $ Job $ MapTaskRunnable。运行(LocalJobRunner.java:243)
在java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:471)
在java.util.concurrent.FutureTask.run(FutureTask.java: 262)
a t java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)$ java.util.concurrent.ThreadPoolExecutor
$ Worker.run(ThreadPoolExecutor.java:615)$ b $ at java.lang.Thread .run(Thread.java:745)
导致:java.lang.ClassNotFoundException:cern.colt.matrix.impl.SparseDoubleMatrix1D $ b $在java.net.URLClassLoader $ 1.run(URLClassLoader.java:366 )在java.net.URLClassLoader上
$ 1.run(URLClassLoader.java:355)$ java.util.AccessController.doPrivileged(Native方法)
在java.net.URLClassLoader.findClass上
$ java.util.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
.. 。28 more


解决方案

我相信这个问题值得一个详细的回答,昨天我被困在这里,浪费了很多时间。我希望这个答案能够帮助碰巧遇到这个问题的所有人。有几个选项可以解决这个问题:


  1. 将外部jar(依赖JAR)作为应用程序jar文件的一部分。您可以使用Eclipse轻松完成此操作。这个选项的缺点是它会使你的应用程序jar膨胀,你的MapRed作业会花费更多的时间来执行。每当您的依赖版本发生变化时,您将不得不重新编译应用程序等等。最好不要走这条路线。

  2. 使用Hadoop classpath - 在命令行运行命令hadoop classpath,然后找到一个合适的文件夹,并将你的jar文件复制到该位置,hadoop将从那里获取依赖关系。这不适用于CloudEra等,因为您可能没有将文件复制到hadoop类路径文件夹的读/写权限。 我使用的选项是用Hadoop jar命令指定-LIBJARS。首先确保你编辑你的驱动类:

      public class myDriverClass extends Configured implements Tool {

    public static void main(String [] args)throws Exception {
    int res = ToolRunner.run(new Configuration(),new myDriverClass(),args);
    System.exit(res);

    $ b $ public int run(String [] args)抛出异常
    {

    //由ToolRunner处理的配置
    配置conf = getConf();
    工作职位=新职位(conf,我的工作);

    ...
    ...

    return job.waitForCompletion(true)? 0:1;


    code $


    现在编辑你的hadoop jar命令,如下所示:

      hadoop jar YourApplication.jar [ myDriverClass] args -libjars path / to / jar / file 

    现在让我们了解下面会发生什么。基本上我们通过实现工具界面。 ToolRunner用于运行实现Tool接口的类。它与 GenericOptionsParser 来解析通用的hadoop命令行参数并修改工具的配置。在我们的Main()中,我们调用了 ToolRunner.run(new Configuration(),new myDriverClass(),args) - 使用给定的泛型参数进行解析后,运行Tool.run(String [])给出的Tool,。它使用给定的配置,或者如果它为空则创建一个配置,然后使用conf的可能修改版本设置工具的配置。



    现在在run方法中,当我们调用getConf()我们得到配置的修改版本。 因此,请确保您的代码中包含以下行。如果你实现了其他的一切并仍然使用Configuration conf = new Configuration(),那么什么都不会起作用。

      Configuration conf = getConf(); 


    I have a Hadoop job in which the mapper must use an external jar.

    I tried to pass this jar to the mapper's JVM

    via the -libjars argument on the hadoop command

    hadoop jar mrrunner.jar DAGMRRunner -libjars <path_to_jar>/colt.jar
    

    via job.addFileToClassPath

    job.addFileToClassPath(new Path("<path_to_jar>/colt.jar"));
    

    on HADOOP_CLASSPATH.

    g1mihai@hydra:/home/g1mihai/$ echo $HADOOP_CLASSPATH
    <path_to_jar>/colt.jar
    

    None of these methods work. This is the stack trace I get back. The missing class it complains about is SparseDoubleMatrix1D is in colt.jar.

    Let me know if I should provide any additional debug info. Thanks.

    15/02/14 16:47:51 INFO mapred.MapTask: Starting flush of map output
    15/02/14 16:47:51 INFO mapred.LocalJobRunner: map task executor complete.
    15/02/14 16:47:51 WARN mapred.LocalJobRunner: job_local368086771_0001
    java.lang.Exception: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
            at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
            at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
    Caused by: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
            at java.lang.Class.getDeclaredFields0(Native Method)
            at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
            at java.lang.Class.getDeclaredField(Class.java:1951)
            at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
            at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
            at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
            at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
            at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
            at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
            at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
            at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
            at BoostConnector.ConnectCalculateBoost(BoostConnector.java:39)
            at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:46)
            at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:22)
            at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
            at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.ClassNotFoundException: cern.colt.matrix.impl.SparseDoubleMatrix1D
            at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
            at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
            ... 28 more
    

    解决方案

    I believe that this question deserves a detailed answer, I was stuck with this yesterday and wasted a lot of time. I hope this answer helps everyone who happen to run into this. There are couple of options to fix this issue:

    1. Include the external jar (dependency JAR) as part of your application jar file. You can easily do this using Eclipse. The disadvantage of this option is that it will bloat up your application jar and your MapRed job will take much more time to get executed. Every time your dependency version changes you will have to recompile the application etc. It's better not to go this route.

    2. Using "Hadoop classpath" - On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.

    3. The option that I made use of was specifying the -LIBJARS with the Hadoop jar command. First make sure that you edit your driver class:

      public class myDriverClass extends Configured implements Tool {
      
        public static void main(String[] args) throws Exception {
           int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
           System.exit(res);
        }
      
        public int run(String[] args) throws Exception
        {
      
          // Configuration processed by ToolRunner 
          Configuration conf = getConf();
          Job job = new Job(conf, "My Job");
      
          ...
          ...
      
          return job.waitForCompletion(true) ? 0 : 1;
        }
      }
      

    Now edit your "hadoop jar" command as shown below:

    hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file
    

    Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.

    Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.

    Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.

    Configuration conf = getConf();
    

    这篇关于如何将外部jar添加到hadoop工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆