在POS标记之前,我们需要使用停用词过滤吗? [英] Do we need to use Stopwords filtering before POS Tagging?

查看:122
本文介绍了在POS标记之前,我们需要使用停用词过滤吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是文本挖掘和NLP相关人员的新手.我正在做一个小项目,试图从一些文档中提取信息.我基本上是在进行pos标记,然后使用分块器来查找

I am new to Text mining and NLP related stuffs.I am working on a small project where I am trying to extract information out of a few documents.I am basically doing a pos tagging and then using a chunker to find out the pattern based on the tagged words.Do I need to use Stopwords before doing this POS tagging?will using stopwords affect my POS tagger's accuracy?

推荐答案

让我们使用

Let's use this as an example to train/test a tagger:

首先获取语料库和非索引字表

First get the corpus and stoplist

>>> import nltk
>>> nltk.download('stopwords')
>>> nltk.download('cess_esp')

在NLTK中加载包装器

Load the wrappers in NLTK

>>> from nltk.corpus import cess_esp as cess
>>> from nltk.corpus import stopwords

# Import the function to train a tagger.
>>> from nltk import UnigramTagger, BigramTagger
# Load the Spanish stopwords
>>> stoplist = stopwords.words('spanish')
# Load the Spanish tagger
>>> cess_sents = cess.tagged_sents()

将语料库拆分为训练/测试集

Split the corpus into train/test sets

>>> len(cess_sents)
6030
>>> test_set = cess_sents[-int(6030/10):]
>>> train_set = cess_sents[:-int(6030/10)]
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> range(10)[-2:]
[8, 9]
>>> range(10)[:-2]
[0, 1, 2, 3, 4, 5, 6, 7]

创建一个不带停用词的备用train_set.

Create an alternate train_set without stopwords.

>>> train_set_nostop = [[(word,tag) for word, tag in sent if word.lower() not in stoplist] for sent in train_set]

查看区别:

>>> train_set[0]
[(u'El', u'da0ms0'), (u'grupo', u'ncms000'), (u'estatal', u'aq0cs0'), (u'Electricit\xe9_de_France', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EDF', u'np00000'), (u'-Fpt-', u'Fpt'), (u'anunci\xf3', u'vmis3s0'), (u'hoy', u'rg'), (u',', u'Fc'), (u'jueves', u'W'), (u',', u'Fc'), (u'la', u'da0fs0'), (u'compra', u'ncfs000'), (u'del', u'spcms'), (u'51_por_ciento', u'Zp'), (u'de', u'sps00'), (u'la', u'da0fs0'), (u'empresa', u'ncfs000'), (u'mexicana', u'aq0fs0'), (u'Electricidad_\xc1guila_de_Altamira', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EAA', u'np00000'), (u'-Fpt-', u'Fpt'), (u',', u'Fc'), (u'creada', u'aq0fsp'), (u'por', u'sps00'), (u'el', u'da0ms0'), (u'japon\xe9s', u'aq0ms0'), (u'Mitsubishi_Corporation', u'np00000'), (u'para', u'sps00'), (u'poner_en_marcha', u'vmn0000'), (u'una', u'di0fs0'), (u'central', u'ncfs000'), (u'de', u'sps00'), (u'gas', u'ncms000'), (u'de', u'sps00'), (u'495', u'Z'), (u'megavatios', u'ncmp000'), (u'.', u'Fp')]
>>> train_set_nostop[0]
[(u'grupo', u'ncms000'), (u'estatal', u'aq0cs0'), (u'Electricit\xe9_de_France', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EDF', u'np00000'), (u'-Fpt-', u'Fpt'), (u'anunci\xf3', u'vmis3s0'), (u'hoy', u'rg'), (u',', u'Fc'), (u'jueves', u'W'), (u',', u'Fc'), (u'compra', u'ncfs000'), (u'51_por_ciento', u'Zp'), (u'empresa', u'ncfs000'), (u'mexicana', u'aq0fs0'), (u'Electricidad_\xc1guila_de_Altamira', u'np00000'), (u'-Fpa-', u'Fpa'), (u'EAA', u'np00000'), (u'-Fpt-', u'Fpt'), (u',', u'Fc'), (u'creada', u'aq0fsp'), (u'japon\xe9s', u'aq0ms0'), (u'Mitsubishi_Corporation', u'np00000'), (u'poner_en_marcha', u'vmn0000'), (u'central', u'ncfs000'), (u'gas', u'ncms000'), (u'495', u'Z'), (u'megavatios', u'ncmp000'), (u'.', u'Fp')]
>>>

培训标记者:

>>> uni_tag = UnigramTagger(train_set)

使用没有停用词的语料库训练标记器:

Train a tagger with corpus without stopwords:

>>> uni_tag_nostop = UnigramTagger(train_set_nostop)

将测试集拆分为单词和标签:

Splits the test_set into words and tags:

>>> test_words, test_tags = zip(*[zip(*sent) for sent in test_set])

标记测试句子:

>>> uni_tag.tag_sents(test_words)
>>> uni_tag_nostop.tag_sents(test_words)

评估准确性(让我们暂时做一下肯定的判断):

Evaluate the accuracy (let's just do true positives for now):

>>> sum([ sum(1 for (word,pred_tag), (word, gold_tag) in zip(pred,gold) if pred_tag==gold_tag) for pred, gold in zip(tagged_sents, test_set)])
11266
>>> sum([ sum(1 for (word,pred_tag), (word, gold_tag) in zip(pred,gold) if pred_tag==gold_tag) for pred, gold in zip(tagged_sents_nostop, test_set)])
5963

请注意,在训练标记器之前删除停用词时,有很多事情是不公平的,并非详尽无遗:

Note there are many things that are unfair here when you removed the stopwords before training the tagger, not exhaustively:

  • 由于没有,您的训练集自然会变小.删除停用词后,句子中的词数较小

  • your training set will naturally be smaller since the no. of words in the sentence is smaller after removing the stopwords

标记器将不会学习停用词的标记,因此将为所有停用词返回无",这会降低标记器的准确性,因为测试集确实包含停用词

the tagger will not learn the tags for the stopwords and hence will return None for all stopwords, reducing your tagger's accuracy since the test set does include stopwords

在训练更高阶的ngram时,如果没有停用词,则可能根本没有任何意义.并不是说语法或敏感性是准确性的原因(尤其是在当今的NLP中).例如,猫在桌子上"->猫桌子",不带停用词.

when training a higher order ngram, without the stopwords, it might not make any sense at all. Not that grammaticality or sensibility accounts for accuracy (esp. in today's NLP). For e.g. , "the cat is on the table" -> "cat table" without stopwords.

但是,如@alexia所指出的,对于基于词袋的向量空间模型(也称为分布式模型,也就是您可以通过其邻居知道一个词"模型,也就是非神经预测嵌入模型) ,删除停用词可能会给您带来一定的准确性.但是对于 TF-IDF , (统计上)神奇的事情是,停用词会自动在TF-IDF中得分较低,因为停用词在大多数文档中出现的频率很高,并且不会使它们具有较少的歧视性,从而使每个文档都不一样(因此它们并不那么重要,这就是IDF发挥作用的地方.

But as @alexia pointed out, for bag-of-words based vector space models (aka as distributed models, aka. "you can know a word by its neighbors" model, aka. the non-neural prediction embedding model), removing the stopwords might bring you some mileage in terms of accuracy. But as for TF-IDF, the (statistically) magical thing is that the stopwords will automatically have a low TF-IDF score since they appear too frequently in most documents and don't that makes them have less discriminatory properties to make each document different (so they are not that important, it's the IDF parts that's doing the magic).

这篇关于在POS标记之前,我们需要使用停用词过滤吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆