Hadoop和NLTK：无法使用停用词 [英] Hadoop and NLTK: Fails with stopwords

查看：364 发布时间：2018/5/31 20:16:00 python hadoop mapreduce cluster-analysis

本文介绍了Hadoop和NLTK：无法使用停用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在Hadoop上运行Python程序。该计划涉及NLTK图书馆。该程序还利用Hadoop Streaming API，如 here。

mapper.py：

 ＃！/ usr / bin / env python 
 import sys 
从nltk.corpus导入nltk 
导入停用词
 
 #print stopwords.words（'english'）
 
用于sys.stdin中的行：
 print line，

reducer.py：

  ＃！/ usr / bin / env python 
 
在sys.stdin中输入sys 
：
 print line，
   
 
  控制台命令：
 
 
  bin / hadoop jar contrib / streaming / hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input / hadoop / input.txt -output / hadoop / output 
  
完美运行，输出只包含输入文件的行。然而，当这一行（来自mapper.py）： 
 
 
 ＃print stopwords.words （'english'） 
 
 
 是 uncommented ，那么程序将失败，并说
 
 作业不成功。错误：失败的地图任务数量超过允许的
限制。 FailedCount：1。
 
我已经检查过并在独立的python程序中，
 
 < blockquote> 
 
 print stopwords.words（'english'）
 
 完美无瑕，所以我绝对难以理解为什么它导致我的Hadoop程序失败。  
 
 
我将不胜感激任何帮助！谢谢 
 解决方案  是打印停用词的文件。字（ '英语'）？如果是的话，你也需要使用 -file 来发送它。
 
I'm trying to run a Python program on Hadoop.  The program involves the NLTK library.  The program also utilizes the Hadoop Streaming API, as described here.

mapper.py:
#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords

#print stopwords.words('english')

for line in sys.stdin:
        print line,
reducer.py:
#!/usr/bin/env python

import sys
for line in sys.stdin:
    print line,
Console command:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output
This runs perfectly, with the output simply containing the lines of the input file.  

However, when this line (from mapper.py): 

#print stopwords.words('english')

is uncommented, then the program fails and says 

  Job not successful. Error: # of failed Map Tasks exceeded allowed
  limit. FailedCount: 1.
I have checked and in a standalone python program, 

  print stopwords.words('english')
works perfectly fine, and so I am absolutely stumped as to why it's causing my Hadoop program to fail.  

I would greatly appreciate any help!  Thank you
 解决方案 
Is 'english' a file in print stopwords.words('english')? If yes, you need to use -file for that too to send it across the nodes.

                        这篇关于Hadoop和NLTK：无法使用停用词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    

                    
                        查看全文

Hadoop和NLTK：无法使用停用词 [英] Hadoop and NLTK: Fails with stopwords

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Hadoop和NLTK：无法使用停用词 [英] Hadoop and NLTK: Fails with stopwords

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭