使用NLTK删除停用词 [英] Stopword removal with NLTK

查看:441
本文介绍了使用NLTK删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过使用nltk工具包删除停用词来处理用户输入的文本,但是通过停用词删除,会删除诸如"and","or","not"之类的词.我希望这些词在停用词删除过程之后出现,因为它们是稍后将文本作为查询处理所必需的运算符.我不知道哪些词可以作为文本查询中的运算符,我还想从文本中删除不必要的词.

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.

推荐答案

我建议您创建自己的从停用词列表中删除的运算符列表.集可以方便地减去,所以:

I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

然后,您可以简单地测试一个单词是in还是not in集合,而不必依赖于您的运算符是否是停用词列表的一部分.然后,您可以稍后切换到另一个停用词列表或添加运算符.

Then you can simply test if a word is in or not in the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.

if word.lower() not in stop:
    # use word

这篇关于使用NLTK删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆