使用NLTK删除停用词 [英] Stopword removal with NLTK
问题描述
我正在尝试通过使用nltk工具包删除停用词来处理用户输入的文本,但是通过停用词删除,会删除诸如"and","or","not"之类的词.我希望这些词在停用词删除过程之后出现,因为它们是稍后将文本作为查询处理所必需的运算符.我不知道哪些词可以作为文本查询中的运算符,我还想从文本中删除不必要的词.
I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after stopword removal process as they are operators which are required for later processing text as query. I don't know which are the words which can be operators in text query, and I also want to remove unnecessary words from my text.
推荐答案
我建议您创建自己的从停用词列表中删除的运算符列表.集可以方便地减去,所以:
I suggest you create your own list of operator words that you take out of the stopword list. Sets can be conveniently subtracted, so:
operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators
然后,您可以简单地测试一个单词是in
还是not in
集合,而不必依赖于您的运算符是否是停用词列表的一部分.然后,您可以稍后切换到另一个停用词列表或添加运算符.
Then you can simply test if a word is in
or not in
the set without relying on whether your operators are part of the stopword list. You can then later switch to another stopword list or add an operator.
if word.lower() not in stop:
# use word
这篇关于使用NLTK删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!