Hadoop - 提交一个有大量依赖项的工作(jar文件) [英] Hadoop - submit a job with lots of dependencies (jar files)

查看:202
本文介绍了Hadoop - 提交一个有大量依赖项的工作(jar文件)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想写一些bootstrap类,它将监听MQ传入消息并将地图/缩减作业提交给Hadoop。这些作业大量使用一些外部库。目前,我已经实现了这些作业,打包为带有bin,lib和日志文件夹的ZIP文件(我使用maven-assembly-plugin将所有东西连接在一起)。



据我了解,当提交一份工作时, ,Hadoop会尝试查找具有映射器/ reducer类的JAR文件,并将该jar复制到网络中,以便将数据节点用于处理数据。但是,我不清楚如何告诉Hadoop复制所有的依赖关系?

我可以使用maven-shade-plugin创建一个具有作业和依赖关系的超级jar,并且

请指教。

解决方案

一种方法是将所需的jar放入分布式缓存中。另一种方法是在Hadoop节点上安装所有必需的罐子并告诉TaskTracker关于它们的位置。我建议你通过这个张贴一次。谈论同样的问题。


I want to write some sort of "bootstrap" class, which will watch MQ for incoming messages and submit map/reduce jobs to Hadoop. These jobs use some external libraries heavily. For the moment I have the implementation of these jobs, packaged as ZIP file with bin,lib and log folders (I'm using maven-assembly-plugin to tie things together).

Now I want to provide small wrappers for Mapper and Reducer, which will use parts of the existing application.

As far as I learned, when a job is submitted, Hadoop tries to find out JAR file, which has the mapper/reducer classes, and copy this jar over network to data node, which will be used to process the data. But it's not clear how do I tell Hadoop to copy all dependencies?

I could use maven-shade-plugin to create an uber-jar with the job and dependencies, And another jar for bootstrap (which jar would be executed with hadoop shell-script).

Please advice.

解决方案

One way could be to put the required jars in distributed cache. Another alternative would be to install all the required jars on the Hadoop nodes and tell TaskTrackers about their location. I would suggest you to go through this post once. Talks about the same issue.

这篇关于Hadoop - 提交一个有大量依赖项的工作(jar文件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆