Hadoop全局变量与流媒体 [英] Hadoop global variable with streaming

查看:104
本文介绍了Hadoop全局变量与流媒体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



但我该如何使用Hadoop Streaming(在我的Python中)来完成这项工作?我知道我可以通过Job和Configuration为映射器提供一些全局值。案例)?



什么是正确的方法? 基于 基于在文档上,您可以指定一个命令行选项( -cmdenv name = value )在每台分布式机器上设置环境变量,然后您可以在映射器/缩减器中使用它们:

<$ p $ $ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM = thing_I_need


I understand that i can give some global value to my mappers via the Job and the Configuration.

But how can i do that using Hadoop Streaming(Python in my case)?

What is the right way?

解决方案

Based on the docs you can specify a command line option (-cmdenv name=value) to set environment variables on each distributed machine that you can then use in your mappers/reducers:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input input.txt \
    -output output.txt \
    -mapper mapper.py \
    -reducer reducer.py \
    -file mapper.py \
    -file reducer.py \
    -cmdenv MY_PARAM=thing_I_need

这篇关于Hadoop全局变量与流媒体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆