在hadoop流中读取文件 [英] Reading file in hadoop streaming

查看:345
本文介绍了在hadoop流中读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在映射器中读取辅助文件,这是我的代码和命令.

I am trying to read an auxiliary file in my mapper and here are my codes and commands.

映射器代码:

#!/usr/bin/env python

from itertools import combinations
from operator import itemgetter
import sys

storage = {}

with open('inputData', 'r') as inputFile:
    for line in inputFile:
         first, second = line.split()
         storage[(first, second)] = 0

for line in sys.stdin:
    do_something()

这是我的命令:

hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result

但是我一直收到此错误,这表明我的映射器无法从stdin读取.删除读取文件部分后,我的代码可以正常工作,因此,我已经查明了错误发生的位置,但是我不知道从中读取错误的正确方法是什么.有人可以帮忙吗?

But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?

 Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():

推荐答案

所收到的错误意味着您的映射器无法长时间写入其stdout流.

The error you are getting means your mapper failed to write to their stdout stream for too long.

例如,常见的错误原因是在do_something()函数中,您有一个包含continue语句且具有某些条件的for循环.然后,当这种情况在您的输入数据中发生得太频繁时,您的脚本将连续连续多次运行,而不会生成任何到stdout的输出. Hadoop等待太长时间而看不到任何内容,因此该任务被视为失败.

For example, a common reason for error is that in your do_something() function, you have a for loop that contains continue statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout. Hadoop waits for too long without seeing anything, so the task is considered failed.

另一种可能性是您的输入数据文件太大,并且读取时间太长.但是我认为这被认为是建立时间,因为它在输出的第一行之前.不过我不确定.

Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.

有两种相对简单的方法可以解决此问题:

There are two relatively easy ways to solve this:

  1. (开发人员方面)修改您的代码以不时输出某些内容.在继续的情况下,编写一个简短的虚拟符号(如'\n'),以使Hadoop知道您的脚本还在运行.

  1. (developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like '\n' to let Hadoop know your script is alive.

(系统端),我相信您可以使用-D选项设置以下参数,该参数控制等待时间(以毫秒为单位)

(system side) I believe you can set the following parameter with -D option, which controls for the waitout time in milli-seconds

mapreduce.reduce.shuffle.read.timeout

mapreduce.reduce.shuffle.read.timeout

我从没有尝试过选项2.通常,我会避免对需要过滤的数据进行流式处理.流,尤其是使用像Python这样的脚本语言完成时,流应该做的工作尽可能少.我的用例大部分是对Apache Pig的输出数据进行后处理,其中过滤已经在Pig脚本中完成,而我需要的内容在Jython中不可用.

I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig, where filtering will already be done in Pig scripts and I need something that is not available in Jython.

这篇关于在hadoop流中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆