使用NLTK ieer或conll2000语料库为NER训练语料库 [英] Train corpus for NER with NLTK ieer or conll2000 corpus

查看:244
本文介绍了使用NLTK ieer或conll2000语料库为NER训练语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试为特定域和新实体训练命名实体识别模型.似乎没有一个完整的合适的管道可用于此目的,因此有必要使用不同的软件包.

I have been trying to train a model for Named Entity Recognition for a specific domain, and with new entities. It seems there is not a completed suitable pipeline for this, and there is the need to use different packages.

我想给NLTK一个机会.我的问题是,如何使用ieer语料库训练NLTK NER对新实体进行分类和匹配?

I would like to give a chance to NLTK. My question is, how can I train a the NLTK NER to classify and match new entities using the ieer corpus?

我当然会提供IOB格式的训练数据,例如:

I will of course provide training data with the IOB-Format like:

We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP

我想我将不得不自己标记令牌.

I guess I will have to tag the tokens by myself.

当我拥有这种格式的文本文件时,下一步该怎么办?如何使用ieer语料库或更好的conll2000训练我的数据?

What do I do next when I have a text file in this format, what are the steps to train my data with the ieer corpus, or with a better one, conll2000?

我知道那里有一些文档,但是我不清楚在标记了训练语料库之后该怎么办.

I know there is some documentation out there, but it is not clear for me what to do after you have a training corpus tagged.

我想参加NLTK,因为然后我想使用 relextract()函数.

I want to go for NLTK because I then want to use the relextract() function.

请提出任何建议.

谢谢

推荐答案

nltk提供了您所需的一切.阅读关于学习文本分类的nltk书籍的第6章.它为您提供了一个可行的分类示例.然后,从第7章中学习第2节和第3节,向您展示如何使用IOB.输入文字并编写分块分类器.尽管示例应用程序未命名为实体识别,但是代码示例几乎不需要进行任何更改即可工作(尽管当然,您需要自定义功能以获得良好的性能.)

The nltk provides everything you need. Read the nltk book's chapter 6, on Learning to Classify Text. It gives you a worked example of classification. Then study sections 2 and 3 from Chapter 7, which show you how to work with IOB text and write a chunking classifier. Although the example application is not named entity recognition, the code examples should need almost no changes to work (although of course you'll need a custom feature function to get decent performance.)

您还可以使用nltk的标记器(或其他标记器)将POS标签添加到语料库,或者您可以趁机尝试对没有词性标签的数据进行分类器训练(只是IOB命名实体)类别).我的猜测是POS标记会提高性能,如果在培训数据上使用与评估(以及最终用于生产)相同的POS标记器,则实际上要好得多.

You can also use the nltk's tagger (or another tagger) to add POS tags to your corpus, or you could take your chances and try to train a classifier on data without part-of-speech tags (just the IOB named entity categories). My guess is that POS tagging will improve performance, and you're actually much better off if the same POS tagger is used on the training data as for evaluation (and eventually production use).

这篇关于使用NLTK ieer或conll2000语料库为NER训练语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆