NLTK感知器标记器的标记集是什么? [英] What is the tagset for NLTK perceptron tagger?

查看:104
本文介绍了NLTK感知器标记器的标记集是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

NLTK感知器标记器的标记集是什么?预训练模型使用的语料库是什么?

What is the tagset for NLTK perceptron tagger? And what is the corpus used for the pre-trained model?

我试图从NLTK网站上找到官方信息.但是他们没有.

I have tried to find the official information from the NLTK website. But they don't have that.

推荐答案

来自 https://github .com/nltk/nltk/pull/1143 ,我们看到它是来自

From https://github.com/nltk/nltk/pull/1143, we see that it's a port from https://spacy.io/blog/part-of-speech-pos-tagger-in-python

训练有素的tagdict中的标签集包括以下标签:

The tagset in the trained tagdict includes the following tags:

>>> from nltk.tag import PerceptronTagger
>>> tagger = PerceptronTagger()
>>> set(tagger.tagdict.values())
set(['PRP$', 'VBG', 'VBD', '``', 'VBN', "''", 'VBP', 'WDT', 'JJ', 'WP', 'VBZ', 'DT', '#', '$', 'NN', ')', '(', ',', '.', 'TO', 'PRP', 'RB', ':', 'NNS', 'NNP', 'VB', 'WRB', 'CC', 'CD', 'EX', 'IN', 'WP$', 'MD', 'JJS', 'JJR'])

完整的标签集是:

>>> sorted(tagger.classes)
['#', '$', "''", '(', ')', ',', '.', ':', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', '``']

这是来自以下位置的Penn Treebank标签集: https://www.ling .upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

It's the Penn Treebank Tagset from: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

这篇关于NLTK感知器标记器的标记集是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆