STDIN或文件作为Hadoop环境中的映射器输入? [英] STDIN or file as mapper input in Hadoop environment?

查看:69
本文介绍了STDIN或文件作为Hadoop环境中的映射器输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于我们需要在非Hadoop中读取一堆文件以映射到 环境中,我使用os.walk(dir)file=open(path, mode)进行读入 每个文件.

As we need to read in bunch of files to mapper, in non-Hadoop environment, I use os.walk(dir) and file=open(path, mode) to read in each file.

但是,在Hadoop环境中,我读到了HadoopStreaming转换 将文件输入到mapper的stdin并将reducer的stdout转换为file 输出,我对如何输入文件有一些疑问:

However, in Hadoop environment, as I read that HadoopStreaming convert file input to stdin of mapper and conver stdout of reducer to file output, I have a few questions about how to input file:

  1. 我们是否必须在mapper.py中设置来自STDIN的输入,并让 HadoopStreaming将hdfs输入目录中的文件转换为STDIN吗?

  1. Do we have to set input from STDIN in mapper.py and let HadoopStreaming convert files in hdfs input directory to STDIN?

如果我想分别读取每个文件并解析每一行,如何 我可以从mapper.py中的文件设置输入吗?

If I want to read in each file separately and parse each line, how can I set input from file in mapper.py?

我先前针对非Hadoop环境集的Python代码: 用于os.walk中的根目录,dirs文件(非hdfs的路径") .....

My previous Python code for non-Hadoop environment sets: for root, dirs, files in os.walk('path of non-hdfs') .....

但是,在Hadoop环境中,我需要将非hdfs的路径"更改为 我将fromFromLocal复制到的HDFS路径,但是我尝试了很多 成功,例如os.walk('/user/hadoop/in') -这就是我检查的内容 通过运行bin/hadoop dfs -ls和os.walk('home/hadoop/files')-this 是我在非Hadoop环境中的本地路径,甚至是os.walk('hdfs:// host:fs_port/user/hadoop/in') ....

However, in Hadoop environment, I need to change 'path of non-hdfs' to a path of HDFS where I copyFromLocal to, but I tried many with no success, such as os.walk('/user/hadoop/in') -- this is what I checked by running bin/hadoop dfs -ls, and os.walk('home/hadoop/files')--this is my local path in non-Hadoop environment, and even os.walk('hdfs:// host:fs_port/user/hadoop/in')....

谁能告诉我是否可以使用文件从文件输入 在mapper.py中操作还是我必须从STDIN输入?

Can anyone tell me whether I can input from file by using file operation in mapper.py or I have to input from STDIN?

谢谢.

推荐答案

Hadoop流式传输接受了STDIN的输入.我认为您遇到的困惑是您试图编写代码来执行Hadoop Streaming为您做的一些事情.当我第一次开始Hadooping时就这样做了.

Hadoop streaming has to take input from STDIN. I think the confusion you're having is you're trying to write code to do some of the things that Hadoop Streaming is doing for you. I did that when I first started Hadooping.

Hadoop流可以读取多个文件,甚至可以读取多个压缩文件,然后将其一次解析一行到映射器的STDIN中.这是一个有用的抽象,因为您随后将映射器编写为与文件名/位置无关.然后,您可以将映射器和缩减器用于以后方便使用的任何输入.另外,您不希望您的映射器尝试获取文件,因为您无法知道以后将拥有多少个映射器.如果文件已编码到映射器中,则如果单个映射器失败,则您永远不会从该映射器中经过硬编码的文件中获得输出.因此,让Hadoop执行文件管理,并使您的代码尽可能通用.

Hadoop streaming can read in multiple files and even multiple zipped files which it then parses, one line at a time, into the STDIN of your mapper. This is a helpful abstraction because you then write your mapper to be file name/location independent. You can then use your mappers and reducers for any input which is handy later. Plus you don't want your mapper trying to grab files because you have no way of knowing how many mappers you will have later. If files were coded into the mapper, then if a single mapper failed you would never get output from the files hard coded in that mapper. So let Hadoop do the file management and have your code be as generic as possible.

这篇关于STDIN或文件作为Hadoop环境中的映射器输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆