我如何在OpenNLP中训练命名实体识别器标识符? [英] How I train an Named Entity Recognizer identifier in OpenNLP?

查看:221
本文介绍了我如何在OpenNLP中训练命名实体识别器标识符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,我有以下代码来训练来自OpenNLP的NER标识符

Ok, I have the following code to train the NER Identifier from OpenNLP

FileReader fileReader = new FileReader("train.txt");
ObjectStream fileStream = new PlainTextByLineStream(fileReader);
ObjectStream sampleStream = new NameSampleDataStream(fileStream);
TokenNameFinderModel model = NameFinderME.train("pt-br", "train", sampleStream, Collections.<String, Object>emptyMap());
nfm = new NameFinderME(model); 

我不知道如果缺少某些东西我做错了,但是分类不管用。我假设train.txt是错误的。

I don't know if I'm doing something wrong of if something is missing, but the classifying is not working. I'm supposing that the train.txt is wrong.

发生的错误是所有令牌只被分类为一种类型。

The error that occurs is that all tokens are classified to only one type.

我的train.txt数据类似于以下示例,但有更多的条目变化和数量。另一件事是我每次都会逐字逐句地分类,而不是所有代币。

My train.txt data is something like the following example, but with a lot more of variation and quantity of entries. Another thing is that I'm classifind word by word from a text per time, and not all tokens.

<START:distance> 8000m <END>
<START:temperature> 100ºC <END>
<START:weight> 50kg <END>
<START:name> Renato <END>

有人可以证明我做错了吗?

Somebody can show what I doing wrong?

推荐答案

您的训练数据不正常。

您应该将所有实体放在句子中的上下文中:

You should put all entities in a context inside a sentence:

At an altitude of <START:distance> 8000m <END> the temperature of boiling water is less than <START:temperature> 100ºC <END> .
The climber <START:name> Renato <END> is carrying <START:weight> 50kg <END> of equipment.

如果您的训练数据来自真实世界的句子且具有相同的风格,您将获得更好的结果你正在分类的句子。例如,如果您要处理新闻,则应该使用报纸语料库进行培训。

You will have better results if your training data derives from real world sentences and have the same style of the sentences you are classifying. For example you should train using a newspaper corpus if you will process news.

此外,您需要数千个句子来构建您的模型!也许你可以从一百个开始来引导并使用可怜的模型来改进你的语料库并再次训练你的模型。

Also you will need thousands of sentences to build your model! Maybe you can start with a hundred to bootstrap and use the poor model to improve your corpus and train your model again.

当然你应该对一个句子的所有标记进行分类,否则将没有上下文来决定实体的类型。

And of course you should classify all tokens of a sentence, otherwise there will be no context to decide the type of an entity.

这篇关于我如何在OpenNLP中训练命名实体识别器标识符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆