为什么这些词被认为是停用词? [英] Why are these words considered stopwords?

查看:33
本文介绍了为什么这些词被认为是停用词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我没有自然语言处理方面的正式背景,想知道 NLP 方面的人是否可以对此有所了解.我正在使用 NLTK 库,我专门研究了这个包提供的停用词功能:

<块引用>

在 [80] 中:nltk.corpus.stopwords.words('english')

出[80]:

['我', '我', '我的',我自己",我们",我们的",我们的",我们自己"、你们"、你们的"、你们的"、你们自己"、你们自己"、他"、他"、他的"、他自己"、她"、她"、她的"、她自己"、它"、它"、它自己"、它们"、他们",他们的",他们的",他们自己"、什么"、哪个"、谁",谁",这个",那个",这些"、那些"、我"、是"、是",是",是",是",曾经"、存在"、有"、有"、有"、有"、做"、做"、'did', 'doing', 'a', 'an', 'the','and', 'but', 'if', 'or','因为', '作为', '直到', '同时','of', 'at', 'by', 'for', 'with','关于', '反对', '之间','进入','通过','期间','之前','之后','上面','下面','到','从','向上',下"、入"、出"、开"、关"、上"、下"、又"、进一步",然后",一次",这里",'那里','什么时候','哪里','为什么','如何','所有','任何','两者',每个"、少数"、更多"、大多数"、'其他', '一些', '这样的', '不','也不是','不是','只有','拥有',相同"、如此"、比"、太"、'very', 's', 't', 'can', 'will','只是'、'不要'、'应该'、'现在']

我不明白的是,为什么会出现不"这个词?确定句子中的情绪不是必要的吗?例如,这样的句子:

<块引用>

我不确定是什么问题.

完全不同,一旦停用词 not 被删除,将句子的含义改为相反的意思(我确定问题是什么).如果是这种情况,我是否缺少什么时候不使用这些停用词的一组规则?

解决方案

停用词列表的概念没有普遍意义,取决于你想做什么.如果您的任务需要了解短语的极性、情感或类似特征,并且您的方法依赖于检测否定(如您的示例),显然您不应该删除not"作为停用词(请注意,您可能仍想删除其他非常常见的不相关词,这些词将构成您的新停用词列表).

但是,要回答您的问题,大多数情感分析方法都非常肤浅.他们寻找充满情感/情感的词,而且——大多数时候——他们不会尝试对句子进行深入分析.

再举一个你想保留停用词的例子:如果你想根据作者(作者归属)对文档进行分类或进行文体测量,你绝对应该保留这些功能词,因为它们是一个重要的特征.风格和话语的一部分.

然而,对于许多其他类型的分析(例如词空间模型、文档相似性、搜索等),删除非常常见的功能词在计算上(您处理更少的词)和在某些情况下(您甚至可能去除停用词后获得更好的结果).如果我想了解一个特定词经常使用的上下文,我希望看到内容词,而不是功能词.>

I do not have a formal background in Natural Language Processing was wondering if someone from the NLP side can shed some light on this. I am playing around with the NLTK library and I was specifically looking into the stopwords function provided by this package:

In [80]: nltk.corpus.stopwords.words('english')

Out[80]:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

What I don't understand is, why is the word "not" present? Isn't that necessary to determine the sentiment inside a sentence? For instance, a sentence like this:

I am not sure what the problem is.

is totally different once the stopword not is removed changing the meaning of the sentence to its opposite (I am sure what the problem is). If that is the case, is there a set of rules that I am missing on when not to use these stopwords?

解决方案

The concept of stop word list does not have a universal meaning and depends on what you want to do. If you have a task where you need to understand the polarity, sentiment or a similar characteristic of a phrase and if your method depends on detecting negation (like in your example), obviously you shouldn't remove "not" as a stop word (note that you may still want to remove other very common unrelated words which would constitute your new stop word list).

However, to answer your question, most of the sentiment analysis methods are very superficial. They look for emotion/sentiment-laden words, and -- most of the time -- they do not attempt a deep analysis of the sentence.

As an another example where you would like to keep the stop words: if you are trying to classify the documents according to their authors (authorship attribution) or carrying out stylometrics, you should definitely keep these functional words as they characterize a big part of the style and the discourse.

However, for many other kinds of analyses (e.g. word space models, document similarity, search, etc.) removing very common, functional words makes sense both computationally (you process fewer words) and in some cases practically (you may even get better results with the stop words removed). If I'm trying to understand the context in which a specific word is used very often, I'd like to see the content words, not the functional words.

这篇关于为什么这些词被认为是停用词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆