OpenNLP:无法识别外国名称 [英] OpenNLP: foreign names does not get recognized

查看:12
本文介绍了OpenNLP:无法识别外国名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始使用 openNLP 来识别名称.我正在使用开放 NLP 附带的模型 (en-ner-person.bin).我注意到虽然它可以识别我们、英国和欧洲的名字,但它无法识别印度或日本的名字.我的问题是 (1) 是否已经有可用于识别外国名称的模型 (2) 如果没有,那么我相信我将需要生成新模型.在这种情况下,是否有我可以使用的语料库?

I just started using openNLP to recognize names. I am using the model (en-ner-person.bin) that comes with open NLP. I noticed that while it recognizes us, uk, and european names, it fails to recognize Indian or Japanese names. My questions are (1) is there already models available that I can use to recognize foreign names (2) If not, then I believe I will need to generate new models. In that case, is there a copora available that I can use?

推荐答案

您可以使用名为 modelbuilder-addon 的 opennlp 插件使用您的数据制作自己的模型,如果您尝试使用它,您可能是第一个这样做的人,除了我……这是全新的.

You can make your own model with your data using an opennlp addon called modelbuilder-addon, if you try it you may be the first one to do so other than me...it's brand new.

它很新,但对我有用.

你给它以下内容:

  • 通过文件的已知实体"列表,其中每一行都是一个名称
  • 通过文件从您的数据中获取的句子列表,其中每一行都是一个句子
  • (可选)用于删除误报的黑名单

你可以在这里查看插件

https://svn.apache.org/repos/asf/opennlp/addons/modelbuilder-addon

你可以用它来开始

import java.io.File;
import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;

public class ModelBuilderAddonUse {

  public static void main(String[] args) {
    File fileOfSentences = new File("path to your sentence file");
    File fileOfNames = new File("path to your file of person names");
    File blackListFile = new File("path to your blacklist file");
    File modelOutFile = new File("path to you where the model will be saved");
    File annotatedSentencesOutFile = new File("path to your sentence file");

    DefaultModelBuilderUtil.generateModel(fileOfSentences, fileOfNames, blackListFile, modelOutFile, annotatedSentencesOutFile, "person", 3);


  }
}

这个想法是你的已知实体(数据中的常用名称)用于创建注释,这些注释用于生成模型,然后模型用于生成更多名称和注释等......该工具将按照迭代"参数执行此操作.你应该运行它,检查你的结果,任何不需要的命中都应该添加到黑名单文件中,然后你可以再次运行训练.我已经使用了这个并得到了很好的结果.如果你发现它有问题,请在 OpenNLP 上提交一张票.

the idea is that your known entities (common names in your data) are used to create annotations, and those annotations are used to generate a model, then the model is used to generate more names and annotations etc... the tool will do this as per the "iterations" parameter. You should run it, check your results, any undesirable hits should be added to the blacklist file, and then you can run the training again. I've used this and got pretty good results. If you find problems with it, put in a ticket at OpenNLP.

这篇关于OpenNLP:无法识别外国名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆