OpenNLP:训练多个实体的自定义NER模型 [英] OpenNLP: Training a custom NER Model for multiple entities

查看:124
本文介绍了OpenNLP:训练多个实体的自定义NER模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为多个实体训练自定义NER模型.这是样本训练数据:

I am trying training a custom NER model for multiple entities. Here is the sample training data:

count all <START:item_type> operating tables <END> on the <START:location_id> third <END> <START:location_type> floor <END>
count all <START:item_type> items <END> on the <START:location_id> third <END> <START:location_type> floor <END>
how many <START:item_type> beds <END> are in <START:location_type> room <END> <START:location_id> 2 <END>

NameFinderME.train(.)方法采用字符串参数type.此参数的用途是什么?而且,如何为多个实体(例如,在我的情况下为item_typelocation_typelocation_id)训练模型

The NameFinderME.train(.) method takes a string parameter type. What is the use of this parameter? And, how can I train a model for multiple entities (e.g. item_type, location_type, location_id in my case)

public static void main(String[] args) {
    String trainingDataFile = "/home/OpenNLPTest/lib/training_data.txt";
    String outputModelFile = "/tmp/model.bin";
    String sentence = "how many beds are in the hospital";

    train(trainingDataFile, outputModelFile, "location_type");
    predict(sentence, outputModelFile);
}

private static void train(String trainingDataFile, String outputModelFile, String tagToFind) {
    File inFile = new File(trainingDataFile);
    NameSampleDataStream nss = null;
    try {
        nss = new NameSampleDataStream(new PlainTextByLineStream(new java.io.FileReader(inFile)));
    } catch (Exception e) {}

    TokenNameFinderModel model = null;
    int iterations = 100;
    int cutoff = 5;
    try {
        // Does the 'type' parameter mean the entity type that I am trying to train the model for?
        // What if I need to train for multiple entities?
        model = NameFinderME.train("en", tagToFind, nss, (AdaptiveFeatureGenerator) null, Collections.<String,Object>emptyMap(), iterations, cutoff); 
    } catch(Exception e) {}

    try {
        File outFile = new File(outputModelFile);           
        FileOutputStream outFileStream = new FileOutputStream(outFile);
        model.serialize(outFileStream);
    }
    catch (Exception ex) {}
}

private static void predict(String sentence, String modelFile) throws Exception {
    FileInputStream modelInToken = new FileInputStream("/tmp/en-token.bin");
    TokenizerModel modelToken = new TokenizerModel(modelInToken);
    Tokenizer tokenizer = new TokenizerME(modelToken); 
    String tokens[] = tokenizer.tokenize(sentence);

    FileInputStream modelIn = new FileInputStream(modelFile);

    TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
    NameFinderME nameFinder = new NameFinderME(model);
    Span nameSpans[] = nameFinder.find(tokens);

    double[] spanProbs = nameFinder.probs(nameSpans);

    for( int i = 0; i<nameSpans.length; i++) {
        System.out.println(nameSpans[i]);
    }

}

推荐答案

NameFinderME.traintype参数用作不包含类型参数的训练数据的默认类型.仅当您的样本如下所示时,这才有意义:

The type argument to NameFinderME.train is used as the default type for training data that does not include a type parameter. This is only relevant if you have a sample that looks like this:

<START> operating tables <END>

而不是这样:

<START:item_type> operating tables <END>

开发人员文档说,要训练多种类型的实体,

To train multiple types of entities, the developer documentation says

培训文件可以包含多种类型.如果训练文件 包含多种类型,创建的模型也将能够检测到 这些多种类型.目前,建议仅训练单人 类型模型,因为多类型支持仍处于试验阶段.

A training file can contain multiple types. If the training file contains multiple types the created model will also be able to detect these multiple types. For now its recommended to only train single type models, since multi type support is still experimental.

因此,您可以尝试对问题样本进行培训,其中包括多种类型,并查看其效果如何.在此邮件列表消息,有人要求提供多种类型的培训状态,并得到以下答案:

So you could try training on the sample from your question, which includes multiple types, and see how well it works. In this mailing list message, someone asks for the status of training for multiple types and gets this answer:

代码路径本身是稳定的,我们将其放在其中的原因是它 在英语数据上表现不佳.

The code path itself is stable, the reason we put it there is that it didn't have a good performance on the English data.

无论如何,性能可能在很大程度上取决于您的数据集和 语言.

Anyway, there performance might highly depend on your data set and the language.

如果处理多种类型的模型无法获得良好的性能,则替代方法是创建训练数据的多个副本,其中每个副本都被修改为仅包含一种类型.然后,您将在每组训练数据上训练一个单独的模型.此时,您应该有一个(例如) item_type 模型,一个 location_type 模型和一个 location_id 模型.然后,您可以通过每个模型运行输入以检测不同的类型.

If you don't get good performance with a model that handles multiple types, the alternative would be to create multiple copies of your training data where each copy is modified to include only one type. You would then train a separate model on each set of training data. At that point you should have a (for example) item_type model, a location_type model, and a location_id model. You could then run your input through each model to detect the different types.

这篇关于OpenNLP:训练多个实体的自定义NER模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆