使用nltk的不带上下文的词性标签 [英] Part-of-speech tag without context using nltk
问题描述
有没有一种简单的方法可以使用nltk为给定单词无上下文确定语音标签的最可能部分.或者,如果不使用任何其他工具/数据集.
Is there an easy way to determine the most likely part of speech tag for a given word without context using nltk. Or if not using any other tool / dataset.
我尝试使用wordnet,但似乎sysnet并不是按可能性排序的.
I tried to use wordnet, but it seems that the sysnets are not ordered by likelihood.
>>> wn.synsets('says')
[Synset('say.n.01'), Synset('state.v.01'), ...]
推荐答案
如果要尝试在没有上下文的情况下进行标记,则需要某种类型的unigram标记器,也称为looup tagger
. 字母组合标记器仅根据给定单词的标记的频率来标记单词.因此,它避免了上下文启发式操作,但是对于任何标记任务,您都必须具有数据.对于unigram,您需要带注释的数据来训练它.请参见nltk教程 http://nltk.googlecode中的lookup tagger
.com/svn/trunk/doc/book/ch05.html .
If you want to try tagging without the context, you are looking for some sort of a unigram tagger, aka looup tagger
. A unigram tagger tags a word solely based on the frequency of the tag given a word. So it avoids the context heuristics, however for any tagging task you must have data. And for the unigrams you need annotated data to train it. See the lookup tagger
in the nltk tutorial http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html.
下面是在NLTK
>>> from nltk.corpus import brown
>>> from nltk import UnigramTagger as ut
>>> brown_sents = brown.tagged_sents()
# Split the data into train and test sets.
>>> train = int(len(brown_sents)*90/100) # use 90% for training
# Trains the tagger
>>> uni_tag = ut(brown_sents[:train]) # this will take some time, ~1-2 mins
# Tags a random sentence
>>> uni_tag.tag ("this is a foo bar sentence .".split())
[('this', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('foo', None), ('bar', 'NN'), ('sentence', 'NN'), ('.', '.')]
# Test the taggers accuracy.
>>> uni_tag.evaluate(brown_sents[train+1:]) # evaluate on 10%, will also take ~1-2 mins
0.8851469586629643
我不建议使用WordNet进行pos标记,因为太多的单词在wordnet中仍然没有条目.但是您可以看一下在Wordnet中使用引理频率,请参见 http://www.cse.unt. edu/〜rada/downloads.html )
I wouldn't recommend using WordNet for pos tagging because just are sooo many words that are still has no entry in wordnet. But you can take a look at using lemma frequencies in wordnet, see How to get the wordnet sense frequency of a synset in NLTK?. These frequencies are based on the SemCor corpus (http://www.cse.unt.edu/~rada/downloads.html)
这篇关于使用nltk的不带上下文的词性标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!