如何覆盖nltk的pos_tag分配给文本的POS标签? [英] How to override the POS tags assigned to a text by nltk's pos_tag?

查看:150
本文介绍了如何覆盖nltk的pos_tag分配给文本的POS标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用nltk中的pos_tag来标记一组(未标记的)技术文档中的文本并获得良好的结果,但是当某些时候它可以用作形容词时,它总是将诸如"authenticated"之类的词标记为动词.换句话说,仅更改标签就不会每次都起作用.

I am using pos_tag from nltk to tag text in a set of (untagged) technical documents and getting good results but it always tags words like "authenticated" as a verb when sometimes it can be used as an adjective. In other words, simply changing the tag would not work every time.

是否有一种很好的方法可以覆盖或纠正考虑上下文的​​标记结果?

Is there a good way to override or correct the tagging results that take account of context?

推荐答案

不幸的是,您的问题归结为如何改善标记?".答案是,您需要构建更好的标记器.所有非平凡的标记器都考虑了上下文,因此,这不仅仅是增加上下文敏感性的问题;它已经存在了,在某些情况下只是失败了.

Unfortunately your question boils down to "how can I improve my tagging?". The answer is, you need to build a better tagger. All non-trivial taggers take context into account, so it's not just a question of adding context sensitivity; it's already there, it's just failing in some cases.

NLTK标记模型允许您链接"标记器,以便每个标记器都可以占据另一个标记器的位置(例如,ngram标记器回落到用于未知单词的正则表达式标记器上).它是这样的:

The NLTK tagging model allows you to "chain" taggers, so that each one can take up where the other left off (e.g., the ngram tagger falls back on a regexp tagger for unknown words). It works like this:

t0 = nltk.DefaultTagger('N')
t1 = nltk.UnigramTagger(traindata, backoff=t0)
t2 = nltk.BigramTagger(traindata, backoff=t1)

traindata以下是标准NLTK格式的已标记句子的列表:每个句子都是(word, tag)格式的元组列表. (如果有理由,您可以为每个标记器使用不同的训练语料库;您肯定要使用一致的标记集).例如,这是一个两句长训练语料库:

traindata here is a list of already tagged sentences in the standard NLTK form: Each sentence is a list of tuples in the form (word, tag). (You could use a different training corpus for each tagger, if you have reason to; you'll definitely want to use a consistent tagset). For example, here's a two-sentence long training corpus:

traindata = [ [ ('His', 'PRO'), ('petition', 'N'), ('charged', 'VD'), 
                ('mental', 'ADJ'), ('cruelty', 'N'), ('.', '.') ],
              [ ('Two', 'NUM'), ('tax', 'N'), ('revision', 'N'), ('bills', 'N'),
                ('were', 'V'), ('passed', 'VN'), ('.', '.') ] ]

Tagger t2(您将使用的那个)将建立一个bigram模型;如果看到未知的输入,它将退回到使用unigram模型的t1上;如果该操作也失败,则将使用t0(仅将所有内容标记为"N").

Tagger t2 (the one you'll use) will build a bigram model; if it sees unknown input, it will fall back on t1, which uses a unigram model; if that fails too, it will defer to t0 (which just tags everything 'N').

您可以添加专用的翻刀来改善默认标记,但是当然您必须首先弄清楚它的作用-当然,这是您首先要问的问题.

You could add a special-purpose retagger to improve the default tagging, but of course you must first figure out what to have it do-- which is of course what you asked in the first place.

如果nltk标注器一遍又一遍地犯同样的错误,则可以放在一起更正的语料库,并以此为基础训练重新标记.您需要多少数据将取决于错误的一致性.我从来没有尝试过,但是Brill标记器通过连续应用重新标记规则来工作,因此也许它是使用的正确工具.

If the nltk tagger keeps making the same kinds of mistakes over and over, you can put together a corpus of corrections and train a re-tagger based on that. How much data you need will depend on how consistent the errors are. I've never tried this but the Brill tagger works by successively applying retagging rules, so perhaps it's the right tool to use.

替代方法是尝试构建自己的特定于域的带标记语料库:使用nltk标记器标记训练集,手动或半自动对其进行纠正,然后在其上训练标记器,并尝试在新数据上获得更好的性能而不是使用默认的nltk标记程序(可能是将两个标记程序链接在一起).

The alternative would be to try building your own domain-specific tagged corpus: Tag a training set with the nltk tagger, correct it manually or semi-automatically, then train a tagger on it and try to get better performance on new data than with the default nltk tagger (perhaps by chaining the two taggers together).

这篇关于如何覆盖nltk的pos_tag分配给文本的POS标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆