在Google Cloud Dataproc环境中使用Hadoop流运行python map reduce作业时出错 [英] Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

查看:76
本文介绍了在Google Cloud Dataproc环境中使用Hadoop流运行python map reduce作业时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用hadoop流方法在Google Cloud Dataproc中运行python map reduce作业.我的地图精简python脚本,输入文件和作业结果输出位于Google云端存储中.

I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.

我试图运行此命令

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat

但是我得到了这个错误输出:

But I got this error output :

文件: /home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py 不存在或不可读.

File: /home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py does not exist, or is not readable.

请尝试-help以获得更多信息流命令失败!

Try -help for more information Streaming Command Failed!

云连接器在hadoop流中不起作用吗?还有其他方法可以使用hadoop流和python脚本以及位于Google Cloud Storage中的输入文件来运行python map reduce作业吗?

Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?

谢谢

推荐答案

hadoop-streaming中的-file选项仅适用于本地文件.但是请注意,其帮助文本中提到不建议使用-file标志,而推荐使用通用-files选项.使用通用的-files选项允许我们指定要暂存的远程(hdfs/gs)文件.还要注意,通用选项必须在应用程序特定标志之前.

The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.

您的调用将变为:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
    -mapper mapper_prod_cat.py \
    -reducer reducer_prod_cat.py \
    -input gs://bucket-name/intro_to_mapreduce/purchases.txt \
    -output gs://bucket-name/intro_to_mapreduce/output_prod_cat

这篇关于在Google Cloud Dataproc环境中使用Hadoop流运行python map reduce作业时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆