Hadoop全局变量与流媒体 [英] Hadoop global variable with streaming
问题描述
但我该如何使用Hadoop Streaming(在我的Python中)来完成这项工作?我知道我可以通过Job和Configuration为映射器提供一些全局值。案例)?
什么是正确的方法? 基于 基于在文档上,您可以指定一个命令行选项( -cmdenv name = value
)在每台分布式机器上设置环境变量,然后您可以在映射器/缩减器中使用它们:
<$ p $ $ HADOOP_HOME / bin / hadoop jar $ HADOOP_HOME / hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM = thing_I_need
I understand that i can give some global value to my mappers via the Job and the Configuration.
But how can i do that using Hadoop Streaming(Python in my case)?
What is the right way?
Based on the docs you can specify a command line option (-cmdenv name=value
) to set environment variables on each distributed machine that you can then use in your mappers/reducers:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input input.txt \
-output output.txt \
-mapper mapper.py \
-reducer reducer.py \
-file mapper.py \
-file reducer.py \
-cmdenv MY_PARAM=thing_I_need
这篇关于Hadoop全局变量与流媒体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!