Hadoop和NLTK:无法使用停用词 [英] Hadoop and NLTK: Fails with stopwords

查看:364
本文介绍了Hadoop和NLTK:无法使用停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Hadoop上运行Python程序。该计划涉及NLTK图书馆。该程序还利用Hadoop Streaming API,如 here。



mapper.py:

 #!/ usr / bin / env python 
import sys
从nltk.corpus导入nltk
导入停用词

#print stopwords.words('english')

用于sys.stdin中的行:
print line,

reducer.py:

  #!/ usr / bin / env python 

在sys.stdin中输入sys

print line,


控制台命令

  bin / hadoop jar contrib / streaming / hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input / hadoop / input.txt -output / hadoop / output 

完美运行,输出只包含输入文件的行。然而,当这一行(来自mapper.py):

#print stopwords.words ('english')



uncommented ,那么程序将失败,并说


作业不成功。错误:失败的地图任务数量超过允许的
限制。 FailedCount:1。

我已经检查过并在独立的python程序中,

< blockquote>

print stopwords.words('english')


完美无瑕,所以我绝对难以理解为什么它导致我的Hadoop程序失败。

我将不胜感激任何帮助!谢谢

解决方案 是打印停用词的文件。字( '英语')?如果是的话,你也需要使用 -file 来发送它。


I'm trying to run a Python program on Hadoop. The program involves the NLTK library. The program also utilizes the Hadoop Streaming API, as described here.

mapper.py:

#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,

reducer.py:

#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,

Console command:

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

This runs perfectly, with the output simply containing the lines of the input file.

However, when this line (from mapper.py):

#print stopwords.words('english')

is uncommented, then the program fails and says

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

I have checked and in a standalone python program,

print stopwords.words('english')

works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.

I would greatly appreciate any help! Thank you

解决方案

Is 'english' a file in print stopwords.words('english')? If yes, you need to use -file for that too to send it across the nodes.

这篇关于Hadoop和NLTK:无法使用停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆