在hadoop流中读取文件 [英] Reading file in hadoop streaming
问题描述
我正在尝试在映射器中读取辅助文件,这是我的代码和命令.
I am trying to read an auxiliary file in my mapper and here are my codes and commands.
映射器代码:
#!/usr/bin/env python
from itertools import combinations
from operator import itemgetter
import sys
storage = {}
with open('inputData', 'r') as inputFile:
for line in inputFile:
first, second = line.split()
storage[(first, second)] = 0
for line in sys.stdin:
do_something()
这是我的命令:
hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result
但是我一直收到此错误,这表明我的映射器无法从stdin读取.删除读取文件部分后,我的代码可以正常工作,因此,我已经查明了错误发生的位置,但是我不知道从中读取错误的正确方法是什么.有人可以帮忙吗?
But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
推荐答案
所收到的错误意味着您的映射器无法长时间写入其stdout
流.
The error you are getting means your mapper failed to write to their stdout
stream for too long.
例如,常见的错误原因是在do_something()
函数中,您有一个包含continue
语句且具有某些条件的for循环.然后,当这种情况在您的输入数据中发生得太频繁时,您的脚本将连续连续多次运行,而不会生成任何到stdout
的输出. Hadoop
等待太长时间而看不到任何内容,因此该任务被视为失败.
For example, a common reason for error is that in your do_something()
function, you have a for loop that contains continue
statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout
. Hadoop
waits for too long without seeing anything, so the task is considered failed.
另一种可能性是您的输入数据文件太大,并且读取时间太长.但是我认为这被认为是建立时间,因为它在输出的第一行之前.不过我不确定.
Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.
有两种相对简单的方法可以解决此问题:
There are two relatively easy ways to solve this:
-
(开发人员方面)修改您的代码以不时输出某些内容.在继续的情况下,编写一个简短的虚拟符号(如
'\n'
),以使Hadoop
知道您的脚本还在运行.
(developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like
'\n'
to letHadoop
know your script is alive.
(系统端),我相信您可以使用-D
选项设置以下参数,该参数控制等待时间(以毫秒为单位)
(system side) I believe you can set the following parameter with -D
option, which controls for the waitout time in milli-seconds
mapreduce.reduce.shuffle.read.timeout
mapreduce.reduce.shuffle.read.timeout
我从没有尝试过选项2.通常,我会避免对需要过滤的数据进行流式处理.流,尤其是使用像Python
这样的脚本语言完成时,流应该做的工作尽可能少.我的用例大部分是对Apache Pig
的输出数据进行后处理,其中过滤已经在Pig
脚本中完成,而我需要的内容在Jython
中不可用.
I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python
, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig
, where filtering will already be done in Pig
scripts and I need something that is not available in Jython
.
这篇关于在hadoop流中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!