python nltk.sent_tokenize错误ascii编解码器无法解码 [英] python nltk.sent_tokenize error ascii codec can't decode
问题描述
我可以成功地将文本读取到变量中,但是在尝试对文本进行标记化时会遇到这个奇怪的错误:
I could successfully read text into a variable but while trying to tokenize the texts im getting this strange error :
sentences=nltk.sent_tokenize(sample)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
我确实知道错误的原因是令牌生成器无法读取/解码的一些特殊字符串/字符,但是如何绕过它呢? 谢谢
I do know the cause of error is some special string/char which the tokenizer isnt able to read/decode but then how to bypass this? Thanks
推荐答案
总而言之,NLTK3的pos_tag函数不起作用.
In a nutshell, NLTK3's pos_tag function doesn't work.
但是NLTK2函数可以正常工作.
The NLTK2 function works fine, however.
pip卸载nltk
pip uninstall nltk
pip安装 http://pypi. python.org/packages/source/n/nltk/nltk-2.0.4.tar.gz
另一方面,标记器非常糟糕(显然"conservatory"是动词).我希望SpaCy在Windows上工作.
On the other hand, the tagger is pretty bad (apparently 'conservatory' is a verb). I wish SpaCy worked on Windows.
这篇关于python nltk.sent_tokenize错误ascii编解码器无法解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!