在NLTK中使用自定义标签来训练Tagger [英] Training Tagger with Custom Tags in NLTK
问题描述
我有一个文档,其中标签数据的格式为Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]
.我想基于一组这些类型的标记文档来训练模型,然后使用我的模型标记新文档.在NLTK中这可能吗?我看过 chunking 和
I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]
. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.
推荐答案
正如@AleksandarSavkov所写的那样,这实际上是一个命名实体识别(NER)任务-或更普遍的说是一个分块任务,正如您已经意识到的那样. NLTK的第7章很好地介绍了如何做到这一点.我建议您忽略有关正则表达式标记的部分,并使用第3节
As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger
). Your responsibility is to select features that will give you good performance.
您需要将数据转换为NLTK架构所期望的IOB格式;它需要语音标记的一部分,因此第一步应该是通过POS标记器运行您的输入; nltk.pos_tag()
将做得足够好(一旦您删除了像[KEYWORD ...]
这样的标记),并且不需要安装任何其他软件.当您的语料库采用以下格式(单词-POS标签-IOB标签)时,您就可以训练识别器了:
You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag()
will do a decent enough job (once you strip off markup like [KEYWORD ...]
), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:
Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...
这篇关于在NLTK中使用自定义标签来训练Tagger的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!