为什么将这些词视为停用词? [英] Why are these words considered stopwords?

查看:326
本文介绍了为什么将这些词视为停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没有自然语言处理方面的正式背景,我想知道NLP方面是否有人可以对此有所阐明.我在玩 NLTK 库,并且我专门研究了此程序包提供的停用词功能:

在[80]中: nltk.corpus.stopwords.words('english')

出[80]:

['i','me','my', 我自己",我们",我们的",我们的", 我们自己",您",您的", 您的",您自己",自己", 他",他",他",他自己", 她",她",她",她自己", 它",它",它自己",它们", 他们",他们的",他们的", 自己",什么",哪个", 谁",谁",这个",那个", 这些",那些",上午",是", '是','是','是','是', 有",有",有",有", 有",有",有",有", 'did','doing','a','an','the', 和",但是",如果",或", '因为','作为','直到','同时', "of","at","by","for","with", 关于",反对",之间", 进入",通过",期间", 之前",之后",之上", 在...下方",到",从",向上", '向下','进入','离开','开启','关闭', 上方",下方",再次", 更多",然后",一次",这里", 有",何时",在哪里",为什么", 如何",全部",任何",两者", 每个",很少",更多",最多", 其他",某些",此类",否", 不",不",仅",自己", 相同",如此",比",太", '非常','s','t','can','will', 只是",不",应该",现在"]

我不明白的是,为什么出现"not"一词?确定句子中的情感不是必要的吗?例如,这样的句子:

我不确定是什么问题.

删除停用词not后,

完全不同,从而将句子的含义改为相反的(I am sure what the problem is).如果是这样,何时不使用这些停用词时是否缺少一套规则?

解决方案

停用词列表的概念没有通用含义,取决于您要执行的操作.如果您有一项任务需要了解短语的极性,情感或类似特征,并且您的方法依赖于检测否定(例如您的示例),那么显然您不应该删除"not"作为停用词(请注意,您可能仍希望删除其他非常不相关的词,这些词会构成新的停用词列表).

但是,为回答您的问题,大多数情绪分析方法都是非常肤浅的.他们寻找充满情感/情感的单词,而且-在大多数情况下-他们并未尝试对句子进行深入的分析.

作为另一个示例,您想保留停用词:如果您试图根据文档的作者(作者署名)对文档进行分类或进行测图,则您一定要保留这些功能词,因为它们具有很大的特点.风格和话语的一部分.

但是,对于许多其他类型的分析(例如,词空间模型,文档相似性,搜索等),删除非常常见的功能词在计算(处理较少的词)和某些情况下(在某些情况下,甚至可能)是有意义的移除停用词可获得更好的结果).如果我想了解经常使用特定单词的上下文,那么我想看看 content 单词,而不是 functional 单词.

I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package:

In [80]: nltk.corpus.stopwords.words('english')

Out[80]:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

What I don't understand is, why is the word "not" present? Isn't that necessary to determine the sentiment inside a sentence? For instance, a sentence like this:

I am not sure what the problem is.

is totally different once the stopword not is removed changing the meaning of the sentence to its opposite (I am sure what the problem is). If that is the case, is there a set of rules that I am missing on when not to use these stopwords?

解决方案

The concept of stop word list does not have a universal meaning and depends on what you want to do. If you have a task where you need to understand the polarity, sentiment or a similar characteristic of a phrase and if your method depends on detecting negation (like in your example), obviously you shouldn't remove "not" as a stop word (note that you may still want to remove other very common unrelated words which would constitute your new stop word list).

However, to answer your question, most of the sentiment analysis methods are very superficial. They look for emotion/sentiment-laden words, and -- most of the time -- they do not attempt a deep analysis of the sentence.

As an another example where you would like to keep the stop words: if you are trying to classify the documents according to their authors (authorship attribution) or carrying out stylometrics, you should definitely keep these functional words as they characterize a big part of the style and the discourse.

However, for many other kinds of analyses (e.g. word space models, document similarity, search, etc.) removing very common, functional words makes sense both computationally (you process fewer words) and in some cases practically (you may even get better results with the stop words removed). If I'm trying to understand the context in which a specific word is used very often, I'd like to see the content words, not the functional words.

这篇关于为什么将这些词视为停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆