在流式传输hadoop程序中获取输入文件名称 [英] Get input file name in streaming hadoop program

查看:100
本文介绍了在流式传输hadoop程序中获取输入文件名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



有没有相应的方法可以做到这一点当我用Python编写程序时(使用流?)

我在apache的hadoop streaming文档中发现了以下内容:


请参阅已组态的参数。在执行流式作业期间,
将转换mapred参数的名称。点(。)
变为下划线(_)。例如,mapred.job.id变成
mapred_job_id,而mapred.jar变成mapred_jar。在你的代码中,使用带有下划线的
参数名。

但是我仍然无法理解如何使用这个在我的映射器中。



任何帮助都非常感谢。



谢谢

解决方案

根据Hadoop:The权威指南



Hadoop将作业配置参数设置为Streaming程序的环境变量。但是,它会用下划线替换非字母数字字符,以确保它们是有效的名称。以下Python表达式说明了如何从Python Streaming脚本中检索mapred.job.id属性的值:
$ b os.environ [mapred_job_id]



您还可以通过将-cmdenv选项应用于Streaming启动器程序(您希望设置的每个变量一次),为MapReduce启动的Streaming进程设置环境变量。例如,以下设置MAGIC_PARAMETER环境变量:

-cmdenv MAGIC_PARAMETER = abracadabra


I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.

Is there a corresponding way to do this when I write a program in Python (using streaming?)

I found the following in the hadoop streaming document on apache:

See Configured Parameters. During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the parameter names with the underscores.

But I still cant understand how to make use of this inside my mapper.

Any help is highly appreciated.

Thanks

解决方案

According to the "Hadoop : The Definitive Guide"

Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:

os.environ["mapred_job_id"]

You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:

-cmdenv MAGIC_PARAMETER=abracadabra

这篇关于在流式传输hadoop程序中获取输入文件名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆