在 NLTK 中使用自定义标签训练标记器 [英] Training Tagger with Custom Tags in NLTK

查看:17
本文介绍了在 NLTK 中使用自定义标签训练标记器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有标记数据的文档,格式为 这是我的 [KEYWORD 电话号码],请告诉我您何时想进行视频群聊:[PHONE 7802708523].我住在 [CITY New York] 的 [PROP_TYPE condo] .我想基于一组这些类型的标记文档来训练模型,然后使用我的模型来标记新文档.这在 NLTK 中可能吗?我看过 chunkingNLTK-Trainer 脚本,但这些有一组受限的标签和语料库,而我的数据集有自定义标签.

I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.

推荐答案

正如@AleksandarSavkov 已经写的那样,这本质上是一个命名实体识别 (NER) 任务——或者更一般地说是一个分块任务,正如您已经意识到的那样.NLTK 书籍的 第 7 章很好地介绍了如何做到这一点.我建议您忽略有关正则表达式标记的部分并使用第 3 部分中的方法,开发和评估组块器.它包括您可以逐字使用的代码示例来创建分词器(连续NPChunkTagger).您的责任是选择能够为您带来良好性能的功能.

As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger). Your responsibility is to select features that will give you good performance.

您需要将数据转换为 NLTK 架构所需的 IOB 格式;它需要词性标签,所以第一步应该是通过词性标注器运行你的输入;nltk.pos_tag() 会做得足够好(一旦你去掉像 [KEYWORD ...] 这样的标记),并且不需要安装额外的软件.当您的语料库采用以下格式(word -- POS-tag -- IOB-tag)时,您就可以训练识别器了:

You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag() will do a decent enough job (once you strip off markup like [KEYWORD ...]), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:

Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...

这篇关于在 NLTK 中使用自定义标签训练标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆