如何将外部 jar 添加到 hadoop 作业? [英] how to add external jar to hadoop job?

查看:27
本文介绍了如何将外部 jar 添加到 hadoop 作业?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Hadoop 作业,其中映射器必须使用外部 jar.

I have a Hadoop job in which the mapper must use an external jar.

我试图将这个 jar 传递给映射器的 JVM

I tried to pass this jar to the mapper's JVM

通过 hadoop 命令的 -libjars 参数

via the -libjars argument on the hadoop command

hadoop jar mrrunner.jar DAGMRRunner -libjars <path_to_jar>/colt.jar

通过 job.addFileToClassPath

via job.addFileToClassPath

job.addFileToClassPath(new Path("<path_to_jar>/colt.jar"));

在 HADOOP_CLASSPATH 上.

on HADOOP_CLASSPATH.

g1mihai@hydra:/home/g1mihai/$ echo $HADOOP_CLASSPATH
<path_to_jar>/colt.jar

这些方法都不起作用.这是我返回的堆栈跟踪.它抱怨的缺失类是 SparseDoubleMatrix1D 在 colt.jar 中.

None of these methods work. This is the stack trace I get back. The missing class it complains about is SparseDoubleMatrix1D is in colt.jar.

如果我应该提供任何额外的调试信息,请告诉我.谢谢.

Let me know if I should provide any additional debug info. Thanks.

15/02/14 16:47:51 INFO mapred.MapTask: Starting flush of map output
15/02/14 16:47:51 INFO mapred.LocalJobRunner: map task executor complete.
15/02/14 16:47:51 WARN mapred.LocalJobRunner: job_local368086771_0001
java.lang.Exception: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
        at java.lang.Class.getDeclaredFields0(Native Method)
        at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
        at java.lang.Class.getDeclaredField(Class.java:1951)
        at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
        at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
        at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
        at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
        at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at BoostConnector.ConnectCalculateBoost(BoostConnector.java:39)
        at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:46)
        at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:22)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: cern.colt.matrix.impl.SparseDoubleMatrix1D
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 28 more

推荐答案

我认为这个问题值得详细解答,我昨天被困在这个问题上,浪费了很多时间.我希望这个答案可以帮助每个碰巧遇到这种情况的人.有几个选项可以解决这个问题:

I believe that this question deserves a detailed answer, I was stuck with this yesterday and wasted a lot of time. I hope this answer helps everyone who happen to run into this. There are couple of options to fix this issue:

  1. 将外部 jar(依赖 JAR)作为应用程序 jar 文件的一部分.您可以使用 Eclipse 轻松完成此操作.此选项的缺点是它会使您的应用程序 jar 膨胀,并且您的 MapRed 作业将花费更多时间来执行.每次您的依赖版本更改时,您都必须重新编译应用程序等.最好不要走这条路.

  1. Include the external jar (dependency JAR) as part of your application jar file. You can easily do this using Eclipse. The disadvantage of this option is that it will bloat up your application jar and your MapRed job will take much more time to get executed. Every time your dependency version changes you will have to recompile the application etc. It's better not to go this route.

使用Hadoop 类路径" - 在命令行上运行命令hadoop 类路径",然后找到一个合适的文件夹并将您的 jar 文件复制到该位置,hadoop 将从那里获取依赖项.这不适用于 CloudEra 等,因为您可能没有将文件复制到 hadoop 类路径文件夹的读/写权限.

Using "Hadoop classpath" - On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.

我使用的选项是使用 Hadoop jar 命令指定 -LIBJARS.首先确保您编辑了驱动程序类:

The option that I made use of was specifying the -LIBJARS with the Hadoop jar command. First make sure that you edit your driver class:

public class myDriverClass extends Configured implements Tool {

  public static void main(String[] args) throws Exception {
     int res = ToolRunner.run(new Configuration(), new myDriverClass(), args);
     System.exit(res);
  }

  public int run(String[] args) throws Exception
  {

    // Configuration processed by ToolRunner 
    Configuration conf = getConf();
    Job job = new Job(conf, "My Job");

    ...
    ...

    return job.waitForCompletion(true) ? 0 : 1;
  }
}

现在编辑您的hadoop jar"命令,如下所示:

Now edit your "hadoop jar" command as shown below:

hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file

现在让我们了解下面发生了什么.基本上我们通过实现 工具界面.ToolRunner 用于运行实现 Tool 接口的类.它与 GenericOptionsParser 解析通用的 hadoop 命令行参数并修改工具的配置.

Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.

在我们的 Main() 中,我们调用 ToolRunner.run(new Configuration(), new myDriverClass(), args) - 这通过 Tool.run(String[]) 运行给定的工具,在解析给定的泛型参数之后.它使用给定的配置,或者如果它为空则构建一个,然后使用可能修改过的 conf 版本设置工具的配置.

Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args) - this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.

现在在 run 方法中,当我们调用 getConf() 时,我们获得了配置的修改版本.因此,请确保您的代码中有以下行.如果您实现了其他所有内容并且仍然使用 Configuration conf = new Configuration(),则没有任何效果.

Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.

Configuration conf = getConf();

这篇关于如何将外部 jar 添加到 hadoop 作业?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆