使用nltk自定义标记 [英] custom tagging with nltk

查看:85
本文介绍了使用nltk自定义标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一种类似于英语的小型语言来指定任务.基本思想是将陈述分为动词和名词短语,这些动词应适用于这些动词和名词短语.我正在与nltk合作,但未获得我希望的结果,例如:

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]

在每种情况下,它都未能意识到第一个单词(选择,移动和复制)被用作动词.我知道我可以创建自定义标签和语法来解决此问题,但是与此同时,当很多此类东西不在我的支持范围内时,我犹豫要重新发明轮子.我特别希望能够同时处理非英语语言的解决方案.

因此,无论如何,我的问题是以下问题之一: 这种语法是否有更好的标记器? 有没有一种方法可以对现有标记器进行加权,使其比名词形式更频繁地使用动词形式? 有没有训练标记器的方法? 总共有更好的方法吗?

解决方案

一种解决方案是创建手册 nltk-trainer 和适当的语料库中的train_tagger.py来训练自己的标记器. >

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]

In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.

So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?

解决方案

One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:

>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

Then you get

>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]

This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.

这篇关于使用nltk自定义标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆