斯坦福命名实体识别器中的多命名命名实体 [英] Multi-term named entities in Stanford Named Entity Recognizer

查看:90
本文介绍了斯坦福命名实体识别器中的多命名命名实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用斯坦福命名实体识别器 http://nlp.stanford.edu/software/CRF- NER.shtml ,并且工作正常.这是

I'm using the Stanford Named Entity Recognizer http://nlp.stanford.edu/software/CRF-NER.shtml and it's working fine. This is

    List<List<CoreLabel>> out = classifier.classify(text);
    for (List<CoreLabel> sentence : out) {
        for (CoreLabel word : sentence) {
            if (!StringUtils.equals(word.get(AnswerAnnotation.class), "O")) {
                namedEntities.add(word.word().trim());           
            }
        }
    }

但是,我发现的问题是识别姓名和姓氏.如果识别器遇到"Joe Smith",则它将分别返回"Joe"和"Smith".我真的很想返回乔·史密斯"作为一个名词.

However the problem I'm finding is identifying names and surnames. If the recognizer encounters "Joe Smith", it is returning "Joe" and "Smith" separately. I'd really like it to return "Joe Smith" as one term.

这可以通过识别器或通过配置来实现吗?到目前为止,我在javadoc中都没有找到任何东西.

Could this be achieved through the recognizer maybe through a configuration? I didn't find anything in the javadoc till now.

谢谢!

推荐答案

这是因为您的内部for循环在单个标记(单词)上进行迭代并分别添加它们.您需要进行更改以一次添加全名.

This is because your inner for loop is iterating over individual tokens (words) and adding them separately. You need to change things to add whole names at once.

一种方法是将内部for循环替换为常规的for循环,并在其内部使用while循环,该循环将使用同一类的相邻非O事物并将其添加为单个实体.*

One way is to replace the inner for loop with a regular for loop with a while loop inside it which takes adjacent non-O things of the same class and adds them as a single entity.*

另一种方法是使用CRFClassifier方法调用:

Another way would be to use the CRFClassifier method call:

List<Triple<String,Integer,Integer>> classifyToCharacterOffsets(String sentences)

这将为您提供整个实体,您可以通过在原始输入上使用substring提取String形式.

which will give you whole entities, which you can extract the String form of by using substring on the original input.

*我们分发的模型使用简单的原始IO标签方案,其中的事物被标记为PERSON或LOCATION,而要做的适当事情就是简单地将具有相同标签的相邻令牌合并在一起.许多NER系统使用更复杂的标签,例如IOB标签,其中B-PERS之类的代码指示人员实体的起始位置. CRFClassifier类和功能工厂支持此类标签,但我们当前分发的模型(截至2012年)未使用它们.

*The models that we distribute use a simple raw IO label scheme, where things are labeled PERSON or LOCATION, and the appropriate thing to do is simply to coalesce adjacent tokens with the same label. Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012).

这篇关于斯坦福命名实体识别器中的多命名命名实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆