在Windows上使用python进行Hadoop流式传输 [英] Hadoop streaming with python on Windows

查看:581
本文介绍了在Windows上使用python进行Hadoop流式传输的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我正在使用以下命令;

我使用Hortonworks HDP for Windows,并成功配置了主设备和2个从设备。

bin \hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:/// d:/ dev /python/mapper.py,file//d:/dev/python/reducer.py -mapperpython mapper.py-reducerpython reduce.py-input /flume/0424/userlog.MDAC-HD1 .MDAC.local..20130424.1366789040945 -output / flume / o%1 -cmdenv PYTHONPATH = c:\python27



映射器贯穿没错,但是日志报告没有找到reduce.py文件。在例外情况下,它看起来像hadoop taskrunner正在为reducer创建符号链接到mapper.py文件。



当我检查作业配置文件时,我发现 mapred.cache.files 设为

hdfs:// MDAC-HD1:8020 / mapred / staging / administrator / .staging / job_201304251054_0021 /files/mapper.py#mapper.py

它看起来像尽管reduce.py文件被添加到jar文件中,但它并未正确包含在配置中并且在Reducer试图运行时找不到。



我认为我的命令是正确的,我试过使用-file参数,但是没有找到文件。

任何人都可以看到或知道一个明显的原因吗?



请注意,这是在Windows上。 / p>

编辑 - 我刚刚在本地运行它,它工作正常,看起来像我的问题可能是在集群周围复制文件。



仍欢迎输入!

那么,尴尬......我的第一个问题,我自己回答。

我发现问题通过重命名hadoop conf文件强制默认设置,这意味着本地作业跟踪器。



这项工作正常运行,它给我提供了解决问题的空间,看起来集群周围的沟通并不像需要的那样完整。 / p>

I'm using Hortonworks HDP for Windows and have it successfully configured with a master and 2 slaves.

I'm using the following command;

bin\hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:///d:/dev/python/mapper.py,file:///d:/dev/python/reducer.py -mapper "python mapper.py" -reducer "python reduce.py" -input /flume/0424/userlog.MDAC-HD1.MDAC.local..20130424.1366789040945 -output /flume/o%1 -cmdenv PYTHONPATH=c:\python27

The mapper runs through fine, but the log reports that the reduce.py file wasn't found. In the exception it looks like the hadoop taskrunner is creating the symlink for the reducer to the mapper.py file.

When I check the job configuration file, I noticed that mapred.cache.files is set to;

hdfs://MDAC-HD1:8020/mapred/staging/administrator/.staging/job_201304251054_0021/files/mapper.py#mapper.py

It looks like although the reduce.py file is being added to the jar file, it's not being included in the configuration correctly and can't be found when the reducer tries to run.

I think my command is correct, I've tried using -file parameters instead but then neither file is found.

Can anyone see or know of an obvious reason?

Please note, this is on Windows.

EDIT- I've just run it locally and it worked, looks like my problem may be with the copying of the files round the cluster.

Still welcome input!

解决方案

Well, thats embarrassing... my first question and I answer it myself.

I found the problem by renaming the hadoop conf file to force default settings which meant the local job tracker.

The job ran properly and it gave me the room to work out what the problem is, looks like communication around the cluster isn't as complete as it need be.

这篇关于在Windows上使用python进行Hadoop流式传输的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆