Hadoop和NLTK:无法使用停用词 [英] Hadoop and NLTK: Fails with stopwords
问题描述
我试图在Hadoop上运行Python程序。该计划涉及NLTK图书馆。该程序还利用Hadoop Streaming API,如 here。
mapper.py:
#!/ usr / bin / env python
import sys
从nltk.corpus导入nltk
导入停用词
#print stopwords.words('english')
用于sys.stdin中的行:
print line,
reducer.py:
#!/ usr / bin / env python
在sys.stdin中输入sys
:
print line,
控制台命令:
bin / hadoop jar contrib / streaming / hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input / hadoop / input.txt -output / hadoop / output
完美运行,输出只包含输入文件的行。然而,当这一行(来自mapper.py):
#print stopwords.words ('english')
是 uncommented ,那么程序将失败,并说
作业不成功。错误:失败的地图任务数量超过允许的
限制。 FailedCount:1。
我已经检查过并在独立的python程序中,
< blockquote>
print stopwords.words('english')
完美无瑕,所以我绝对难以理解为什么它导致我的Hadoop程序失败。
我将不胜感激任何帮助!谢谢
解决方案 是打印停用词的文件。字( '英语')
?如果是的话,你也需要使用-file
来发送它。I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.
mapper.py:
#!/usr/bin/env python import sys import nltk from nltk.corpus import stopwords #print stopwords.words('english') for line in sys.stdin: print line,
reducer.py:
#!/usr/bin/env python import sys for line in sys.stdin: print line,
Console command:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output
This runs perfectly, with the output simply containing the lines of the input file.
However, when this line (from mapper.py):
#print stopwords.words('english')
is uncommented, then the program fails and says
Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
I have checked and in a standalone python program,
print stopwords.words('english')
works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.
I would greatly appreciate any help! Thank you
解决方案Is 'english' a file in
print stopwords.words('english')
? If yes, you need to use-file
for that too to send it across the nodes.这篇关于Hadoop和NLTK:无法使用停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文