在NLTK中使用自定义标签来训练Tagger [英] Training Tagger with Custom Tags in NLTK

查看:125
本文介绍了在NLTK中使用自定义标签来训练Tagger的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文档,其中标签数据的格式为Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York].我想基于一组这些类型的标记文档来训练模型,然后使用我的模型标记新文档.在NLTK中这可能吗?我看过 chunking

I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.

推荐答案

正如@AleksandarSavkov所写的那样,这实际上是一个命名实体识别(NER)任务-或更普遍的说是一个分块任务,正如您已经意识到的那样. NLTK的第7章很好地介绍了如何做到这一点.我建议您忽略有关正则表达式标记的部分,并使用第3节 ).您的责任是选择能够为您带来良好性能的功能.

As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger). Your responsibility is to select features that will give you good performance.

您需要将数据转换为NLTK架构所期望的IOB格式;它需要语音标记的一部分,因此第一步应该是通过POS标记器运行您的输入; nltk.pos_tag()将做得足够好(一旦您删除了像[KEYWORD ...]这样的标记),并且不需要安装任何其他软件.当您的语料库采用以下格式(单词-POS标签-IOB标签)时,您就可以训练识别器了:

You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag() will do a decent enough job (once you strip off markup like [KEYWORD ...]), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:

Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...

这篇关于在NLTK中使用自定义标签来训练Tagger的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆