如何将外部jar添加到hadoop工作? [英] how to add external jar to hadoop job?
问题描述
我有一个Hadoop作业,其中映射器必须使用外部jar。
我试图通过hadoop命令中的-libjars参数将此jar传递给映射器的JVM
/ p>
hadoop jar mrrunner.jar DAGMRRunner -libjars< path_to_jar> /colt.jar
通过job.addFileToClassPath
job.addFileToClassPath新路径(< path_to_jar> /colt.jar));
在HADOOP_CLASSPATH上。
g1mihai @ hydra:/ home / g1mihai / $ echo $ HADOOP_CLASSPATH
/colt.jar
这些方法都不起作用。这是我回来的堆栈跟踪。它抱怨的缺少的类是SparseDoubleMatrix1D在colt.jar中。
让我知道是否应该提供任何额外的调试信息。
15/02/14 16:47:51信息mapred.MapTask:开始刷新地图输出
15/02/14 16:47:51信息mapred.LocalJobRunner:地图任务执行器完成。
15/02/14 16:47:51警告mapred.LocalJobRunner:job_local368086771_0001
java.lang.Exception:java.lang.NoClassDefFoundError:Lcern / colt / matrix / impl / SparseDoubleMatrix1D;
at org.apache.hadoop.mapred.LocalJobRunner $ Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner $ Job.run(LocalJobRunner.java:522)
引起:java.lang.NoClassDefFoundError:Lcern / colt / matrix / impl / SparseDoubleMatrix1D;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
at java.lang.Class.getDeclaredField(Class。 java:1951)
在java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
在java.io.ObjectStreamClass.access $ 700(ObjectStreamClass.java:72)
在java。 io.ObjectStreamClass $ 2.run(ObjectStreamClass.java:480)
在java.io.ObjectStreamClass $ 2.run(ObjectStreamClass.java:468)$ b $ java.util.AccessController.doPrivileged(Native Method)$ b $ < init>(ObjectStreamClass.java:468)$ b $在java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
在java.io.ObjectStreamClass .initNonProxy(ObjectStreamClass.java:602)
位于java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
位于java.io.ObjectInputStream.readClassDesc(ObjectInputSt
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java .io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at BoostConnector.ConnectCalculateBoost(BoostConnector.java:39)
at DAGMapReduceSearcher $ Map.map(DAGMapReduceSearcher.java:46)
at DAGMapReduceSearcher $ Map.map(DAGMapReduceSearcher.java:22)
在org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
在org.apache.hadoop.mapred.MapTask。 runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner $ Job $ MapTaskRunnable。运行(LocalJobRunner.java:243)
在java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:471)
在java.util.concurrent.FutureTask.run(FutureTask.java: 262)
a t java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)$ java.util.concurrent.ThreadPoolExecutor
$ Worker.run(ThreadPoolExecutor.java:615)$ b $ at java.lang.Thread .run(Thread.java:745)
导致:java.lang.ClassNotFoundException:cern.colt.matrix.impl.SparseDoubleMatrix1D $ b $在java.net.URLClassLoader $ 1.run(URLClassLoader.java:366 )在java.net.URLClassLoader上
$ 1.run(URLClassLoader.java:355)$ java.util.AccessController.doPrivileged(Native方法)
在java.net.URLClassLoader.findClass上
$ java.util.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
.. 。28 more
我相信这个问题值得一个详细的回答,昨天我被困在这里,浪费了很多时间。我希望这个答案能够帮助碰巧遇到这个问题的所有人。有几个选项可以解决这个问题:
-
将外部jar(依赖JAR)作为应用程序jar文件的一部分。您可以使用Eclipse轻松完成此操作。这个选项的缺点是它会使你的应用程序jar膨胀,你的MapRed作业会花费更多的时间来执行。每当您的依赖版本发生变化时,您将不得不重新编译应用程序等等。最好不要走这条路线。
使用Hadoop classpath - 在命令行运行命令hadoop classpath,然后找到一个合适的文件夹,并将你的jar文件复制到该位置,hadoop将从那里获取依赖关系。这不适用于CloudEra等,因为您可能没有将文件复制到hadoop类路径文件夹的读/写权限。 我使用的选项是用Hadoop jar命令指定-LIBJARS。首先确保你编辑你的驱动类: Include the external jar (dependency JAR) as part of your application jar file. You can easily do this using Eclipse. The disadvantage of this option is that it will bloat up your application jar and your MapRed job will take much more time to get executed. Every time your dependency version changes you will have to recompile the application etc. It's better not to go this route.
Using "Hadoop classpath" - On the command line run the command "hadoop classpath" and then find a suitable folder and copy your jar file to that location and hadoop will pick up the dependencies from there. This wont work with CloudEra etc as you may not have read/write rights to copy files to the hadoop classpath folders.
The option that I made use of was specifying the -LIBJARS with the Hadoop jar command. First make sure that you edit your driver class:
public class myDriverClass extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new myDriverClass(), args); System.exit(res); } public int run(String[] args) throws Exception { // Configuration processed by ToolRunner Configuration conf = getConf(); Job job = new Job(conf, "My Job"); ... ... return job.waitForCompletion(true) ? 0 : 1; } }
public class myDriverClass extends Configured implements Tool {
public static void main(String [] args)throws Exception {
int res = ToolRunner.run(new Configuration(),new myDriverClass(),args);
System.exit(res);
$ b $ public int run(String [] args)抛出异常
{
//由ToolRunner处理的配置
配置conf = getConf();
工作职位=新职位(conf,我的工作);
...
...
return job.waitForCompletion(true)? 0:1;
code $
现在编辑你的hadoop jar命令,如下所示:
hadoop jar YourApplication.jar [ myDriverClass] args -libjars path / to / jar / file
现在让我们了解下面会发生什么。基本上我们通过实现工具界面。 ToolRunner用于运行实现Tool接口的类。它与 GenericOptionsParser 来解析通用的hadoop命令行参数并修改工具的配置。在我们的Main()中,我们调用了 ToolRunner.run(new Configuration(),new myDriverClass(),args)
- 使用给定的泛型参数进行解析后,运行Tool.run(String [])给出的Tool,。它使用给定的配置,或者如果它为空则创建一个配置,然后使用conf的可能修改版本设置工具的配置。
现在在run方法中,当我们调用getConf()我们得到配置的修改版本。 因此,请确保您的代码中包含以下行。如果你实现了其他的一切并仍然使用Configuration conf = new Configuration(),那么什么都不会起作用。
Configuration conf = getConf();
I have a Hadoop job in which the mapper must use an external jar.
I tried to pass this jar to the mapper's JVM
via the -libjars argument on the hadoop command
hadoop jar mrrunner.jar DAGMRRunner -libjars <path_to_jar>/colt.jar
via job.addFileToClassPath
job.addFileToClassPath(new Path("<path_to_jar>/colt.jar"));
on HADOOP_CLASSPATH.
g1mihai@hydra:/home/g1mihai/$ echo $HADOOP_CLASSPATH
<path_to_jar>/colt.jar
None of these methods work. This is the stack trace I get back. The missing class it complains about is SparseDoubleMatrix1D is in colt.jar.
Let me know if I should provide any additional debug info. Thanks.
15/02/14 16:47:51 INFO mapred.MapTask: Starting flush of map output
15/02/14 16:47:51 INFO mapred.LocalJobRunner: map task executor complete.
15/02/14 16:47:51 WARN mapred.LocalJobRunner: job_local368086771_0001
java.lang.Exception: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NoClassDefFoundError: Lcern/colt/matrix/impl/SparseDoubleMatrix1D;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
at java.lang.Class.getDeclaredField(Class.java:1951)
at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at BoostConnector.ConnectCalculateBoost(BoostConnector.java:39)
at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:46)
at DAGMapReduceSearcher$Map.map(DAGMapReduceSearcher.java:22)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: cern.colt.matrix.impl.SparseDoubleMatrix1D
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 28 more
解决方案 I believe that this question deserves a detailed answer, I was stuck with this yesterday and wasted a lot of time. I hope this answer helps everyone who happen to run into this. There are couple of options to fix this issue:
Now edit your "hadoop jar" command as shown below:
hadoop jar YourApplication.jar [myDriverClass] args -libjars path/to/jar/file
Now lets understand what happens underneath. Basically we are handling the new command line arguments by implementing the TOOL Interface. ToolRunner is used to run classes implementing Tool interface. It works in conjunction with GenericOptionsParser to parse the generic hadoop command line arguments and modifies the Configuration of the Tool.
Within our Main() we are calling ToolRunner.run(new Configuration(), new myDriverClass(), args)
- this runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. It uses the given Configuration, or builds one if it's null and then sets the Tool's configuration with the possibly modified version of the conf.
Now within the run method, when we call getConf() we get the modified version of the Configuration. So make sure that you have the below line in your code. If you implement everything else and still make use of Configuration conf = new Configuration(), nothing would work.
Configuration conf = getConf();
这篇关于如何将外部jar添加到hadoop工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!