如何在 OpenNLP 中创建一个好的 NER 训练模型? [英] How to create a good NER training model in OpenNLP?

查看:27
本文介绍了如何在 OpenNLP 中创建一个好的 NER 训练模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用 OpenNLP.我需要创建一个简单的训练模型来识别名称实体.

I just have started with OpenNLP. I need to create a simple training model to recognize name entities.

在此处阅读文档 https:///opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind 我看到这个简单的文字来训练模型:

Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .
<START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC ,
    was named a director of this British industrial conglomerate .

有两个问题:

  • 为什么我必须将人名放在文本(短语)上下文中?为什么不每行写一个人的名字?喜欢:

  • Why should i have to put the names of the persons in a text (phrase) context ? Why not write person's name one for each line? like:

<START:person> Robert <END>

<START:person> Maria <END>

<START:person> John <END>

  • 我还可以如何向该名称添加额外信息?例如,我想为每个名字保存信息男/女.

  • How can I also add extra information to that name? For example I would like to save the information Male/Female for each name.

    (我知道有些系统会尝试通过阅读最后一个字母来理解它,例如 Female 的a"等,但我想自己添加)

    (I know there are systems that try to understand it reading the last letter, like the "a" for Female etc but i would like to add it myself)

    谢谢.

    推荐答案

    第一个问题的答案是算法适用于句子中的周围上下文(标记);它不仅仅是一个简单的查找机制.OpenNLP 使用最大熵,这是一种多项逻辑回归的形式来构建其模型.这样做的原因是为了减少词义歧义",并在上下文中查找实体.例如,如果我的名字是四月,我很容易与四月混淆,如果我的名字是五月,那么我会与五月以及动词可能混淆.对于第一个问题的第二部分,您可以列出已知名称,并在程序中使用这些名称,该程序会查看您的句子并自动对其进行注释以帮助您创建训练集,但是列出名称单独没有上下文将无法充分或根本无法训练模型.事实上,有一个名为modelbuilder addon"的 OpenNLP 插件专门为此设计:你给它一个名称文件,它使用名称和你的一些数据(句子)来训练模型.如果您正在寻找通常不含歧义的实体的特定名称,您最好使用列表和诸如正则表达式之类的东西来发现名称而不是 NER.

    The answer to your first question is that the algorithm works on surrounding context(tokens) within a sentence; it's not just a simple lookup mechanism. OpenNLP uses maximum entropy, which is a form of multinomial logistic regression to build its model. The reason for this is to reduce "word sense ambiguity," and find entities in context. For instance, if my name is April, I can easily get confused with the month of April, and if my name is May, then I would get confused with the month of May as well as the verb may. For your second part of the first question, you could make a list of names that are known, and use those names in a program that looks at your sentences and automatically annotates them to help you create a training set, however making a list of names alone without context will not train the model sufficiently or at all. In fact, there is an OpenNLP addon called the "modelbuilder addon" designed for this: you give it a file of names, and it uses the names and some of your data (sentences) to train a model. If you are looking for particular names of generally non ambiguous entities, you may be better off just using a list and something like regex to discover names rather than NER.

    至于你的第二个问题,有几个选项,但总的来说,我不认为 N​​ER 是描述性别等事物的好工具,但是如果有足够的训练句子,你可能会得到不错的结果.由于 NER 使用基于句子训练集中周围标记的模型来建立命名实体的存在,因此它在识别性别方面无能为力.您最好找到所有人名,然后参考您知道是男性或女性的姓名索引以进行匹配.此外,一些名字,如 Pat,既有男性也有女性,在大多数文本数据中,不会有任何指示,既不属于人类也不属于机器.也就是说,您可以分别创建男性和女性模型,也可以在同一模型中创建不同的实体类型.您可以使用这样的注释(使用male.person 和female.person 的不同实体类型名称).我从来没有试过这个,但它可能没问题,你必须在你的数据上测试它.

    As for your second question there are a few options, but in general, I don't think NER is a great tool for delineating something like gender, however with enough training sentences you may get decent results. Since NER uses a model based on surrounding tokens in your sentence training set to establish the existence of a named entity, it can't do much in terms of identifying gender. You may be better off finding all person names, then referencing an index of names that you know are male or female to get a match. Also, some names, like Pat, are both male and female, and in most textual data there will be no indication of which it is to neither human nor machine. That being said, you could create a male and female model separately, or you could create different entity types within the same model. You could use an annotation like this (using different entity type names of male.person and female.person). I've never tried this but it might do ok, you'd have to test it on your data.

    <START:male.person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
    Mrs . <START:female.person> Maria <END> is chairman of Elsevier N.V. , the Dutch publishing group
    

    NER=命名实体识别

    HTH

    这篇关于如何在 OpenNLP 中创建一个好的 NER 训练模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆