OpenNLP:外国名称不被认可 [英] OpenNLP: foreign names does not get recognized

查看:92
本文介绍了OpenNLP:外国名称不被认可的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用openNLP来识别名称.我正在使用开放NLP随附的模型(en-ner-person.bin).我注意到,虽然它可以识别我们,英国和欧洲的名称,但无法识别印度或日本的名称.我的问题是(1)是否已有可用的模型来识别外来名称(2)如果没有,那么我认为我将需要生成新模型.在那种情况下,我可以使用复印机吗?

I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use?

推荐答案

您可以使用名为modelbuilder-addon的opennlp插件使用数据创建自己的模型,如果尝试使用,则可能是除此以外的第一个这样做的人我...这是全新的.

You can make your own model with your data using an opennlp addon called modelbuilder-addon, if you try it you may be the first one to do so other than me...it's brand new.

这是很新的东西,但是对我有用.

it is very new, but it works for me.

您向其提供以下内容:

  • 通过文件的已知实体"列表,其中每行都是一个名称
  • 通过文件从您的数据中获取句子列表,其中每一行都是一个句子
  • (可选)黑名单以删除误报

您可以在此处签出插件

https://svn.apache.org/repos/asf/opennlp/addons/modelbuilder-addon

您可以使用它开始

import java.io.File;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;

public class ModelBuilderAddonUse {

  public static void main(String[] args) {
    File fileOfSentences = new File("path to your sentence file");
    File fileOfNames = new File("path to your file of person names");
    File blackListFile = new File("path to your blacklist file");
    File modelOutFile = new File("path to you where the model will be saved");
    File annotatedSentencesOutFile = new File("path to your sentence file");

    DefaultModelBuilderUtil.generateModel(fileOfSentences, fileOfNames, blackListFile, modelOutFile, annotatedSentencesOutFile, "person", 3);


  }
}

这个想法是,将您的已知实体(数据中的通用名称)用于创建注释,然后将这些注释用于生成模型,然后使用该模型生成更多名称和注释等...该工具将按照迭代"参数执行此操作.您应该运行它,检查结果,任何不受欢迎的命中都应添加到黑名单文件中,然后您可以再次运行训练.我已经使用了它,并获得了不错的结果.如果您发现问题,请在OpenNLP上插入一张票.

the idea is that your known entities (common names in your data) are used to create annotations, and those annotations are used to generate a model, then the model is used to generate more names and annotations etc... the tool will do this as per the "iterations" parameter. You should run it, check your results, any undesirable hits should be added to the blacklist file, and then you can run the training again. I've used this and got pretty good results. If you find problems with it, put in a ticket at OpenNLP.

这篇关于OpenNLP:外国名称不被认可的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆