使用nltk的不带上下文的词性标签 [英] Part-of-speech tag without context using nltk

查看:83
本文介绍了使用nltk的不带上下文的词性标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种简单的方法可以使用nltk为给定单词无上下文确定语音标签的最可能部分.或者,如果不使用任何其他工具/数据集.

Is there an easy way to determine the most likely part of speech tag for a given word without context using nltk. Or if not using any other tool / dataset.

我尝试使用wordnet,但似乎sysnet并不是按可能性排序的.

I tried to use wordnet, but it seems that the sysnets are not ordered by likelihood.

>>> wn.synsets('says')

[Synset('say.n.01'), Synset('state.v.01'), ...]

推荐答案

如果要尝试在没有上下文的情况下进行标记,则需要某种类型的unigram标记器,也称为looup tagger. 字母组合标记器仅根据给定单词的标记的频率来标记单词.因此,它避免了上下文启发式操作,但是对于任何标记任务,您都必须具有数据.对于unigram,您需要带注释的数据来训练它.请参见nltk教程 http://nltk.googlecode中的lookup tagger .com/svn/trunk/doc/book/ch05.html .

If you want to try tagging without the context, you are looking for some sort of a unigram tagger, aka looup tagger. A unigram tagger tags a word solely based on the frequency of the tag given a word. So it avoids the context heuristics, however for any tagging task you must have data. And for the unigrams you need annotated data to train it. See the lookup tagger in the nltk tutorial http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html.

下面是在NLTK

>>> from nltk.corpus import brown
>>> from nltk import UnigramTagger as ut
>>> brown_sents = brown.tagged_sents()
# Split the data into train and test sets.
>>> train = int(len(brown_sents)*90/100) # use 90% for training
# Trains the tagger
>>> uni_tag = ut(brown_sents[:train]) # this will take some time, ~1-2 mins
# Tags a random sentence
>>> uni_tag.tag ("this is a foo bar sentence .".split())
[('this', 'DT'), ('is', 'BEZ'), ('a', 'AT'), ('foo', None), ('bar', 'NN'), ('sentence', 'NN'), ('.', '.')]
# Test the taggers accuracy.
>>> uni_tag.evaluate(brown_sents[train+1:]) # evaluate on 10%, will also take ~1-2 mins
0.8851469586629643

我不建议使用WordNet进行pos标记,因为太多的单词在wordnet中仍然没有条目.但是您可以看一下在Wordnet中使用引理频率,请参见 http://www.cse.unt. edu/〜rada/downloads.html )

I wouldn't recommend using WordNet for pos tagging because just are sooo many words that are still has no entry in wordnet. But you can take a look at using lemma frequencies in wordnet, see How to get the wordnet sense frequency of a synset in NLTK?. These frequencies are based on the SemCor corpus (http://www.cse.unt.edu/~rada/downloads.html)

这篇关于使用nltk的不带上下文的词性标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆