使用Hadoop流管理依赖关系? [英] Managing dependencies with Hadoop Streaming?

查看:194
本文介绍了使用Hadoop流管理依赖关系?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个快速的Hadoop Streaming问题.如果我使用的是Python流,并且我的映射器/缩减器需要Python包,但默认情况下未安装它们,那么我是否也需要在所有Hadoop机器上安装它们,或者是否存在某种序列化将其发送到远程机器?

I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?

推荐答案

如果未在任务栏中安装它们,则可以使用-file将其发送.如果需要包或其他目录结构,则可以发送一个zip文件,该文件将为您解压缩.这是一个Haddop 0.17调用:

If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

但是,请注意以下问题:

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596

这篇关于使用Hadoop流管理依赖关系?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆