python nltk.sent_tokenize错误ascii编解码器无法解码 [英] python nltk.sent_tokenize error ascii codec can't decode

查看:108
本文介绍了python nltk.sent_tokenize错误ascii编解码器无法解码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以成功地将文本读取到变量中,但是在尝试对文本进行标记化时会遇到这个奇怪的错误:

I could successfully read text into a variable but while trying to tokenize the texts im getting this strange error :

sentences=nltk.sent_tokenize(sample)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)

我确实知道错误的原因是令牌生成器无法读取/解码的一些特殊字符串/字符,但是如何绕过它呢? 谢谢

I do know the cause of error is some special string/char which the tokenizer isnt able to read/decode but then how to bypass this? Thanks

推荐答案

总而言之,NLTK3的pos_tag函数不起作用.

In a nutshell, NLTK3's pos_tag function doesn't work.

但是NLTK2函数可以正常工作.

The NLTK2 function works fine, however.

pip卸载nltk

pip uninstall nltk

pip安装 http://pypi. python.org/packages/source/n/nltk/nltk-2.0.4.tar.gz

另一方面,标记器非常糟糕(显然"conservatory"是动词).我希望SpaCy在Windows上工作.

On the other hand, the tagger is pretty bad (apparently 'conservatory' is a verb). I wish SpaCy worked on Windows.

这篇关于python nltk.sent_tokenize错误ascii编解码器无法解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆