斯坦福大学NLP命名具有多个令牌的实体 [英] Stanford NLP named entities of more than one token

查看:163
本文介绍了斯坦福大学NLP命名具有多个令牌的实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Stanford Core NLP进行命名实体识别.

I'm experimenting with Stanford Core NLP for named entity recognition.

某些命名实体包含多个令牌,例如Person:"Bill Smith".我无法弄清楚用什么API调用来确定"Bill"和"Smith"何时应被视为单个实体,以及何时应将其视为两个不同的实体.

Some named entities consist of more than one token, for example, Person: "Bill Smith". I can't figure out what API calls to use to determine when "Bill" and "Smith" should be considered a single entity, and when they should be two different entities.

在某处有一些不错的文档来解释这一点吗?

Is there some decent documentation somewhere which explains this?

这是我当前的代码:

    InputStream is = getClass().getResourceAsStream(MODEL_NAME);
    if (MODEL_NAME.endsWith(".gz")) {
        is = new GZIPInputStream(is);
    }
    is = new BufferedInputStream(is);

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

    AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(is);
    is.close();

    String text = "Hello, Bill Smith, how are you?";

    List<List<CoreLabel>> sentences = classifier.classify(text);
    for (List<CoreLabel> sentence: sentences) {
        for (CoreLabel word: sentence) {
            String type = word.get(CoreAnnotations.AnswerAnnotation.class);
            System.out.println(word + " is of type " + type);
        }
    }

此外,我不清楚为什么"PERSON"注释会以AnswerAnnotation的形式返回,而不是CoreAnnotations.EntityClassAnnotation,EntityTypeAnnotation或其他形式.

Also, it isn't clear to me why the "PERSON" annotation is coming back as AnswerAnnotation, instead of CoreAnnotations.EntityClassAnnotation, EntityTypeAnnotation, or something else.

推荐答案

您应该使用"entitymentions"注释器,该注释器将标记连续的令牌序列,并使用与实体相同的ner标签.每个句子的实体列表将存储在CoreAnnotations.MentionsAnnotation.class键下.每个提及的实体本身就是一个CoreMap.

You should use the "entitymentions" annotator, which will mark continuous sequences of tokens with the same ner tag as an entity. The list of entities for each sentence will be stored under the CoreAnnotations.MentionsAnnotation.class key. Each entity mention itself will be a CoreMap.

查看此代码可能会有所帮助:

Looking over this code could help:

https://github .com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/EntityMentionsAnnotator.java

一些示例代码:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.util.*;



public class EntityMentionsExample {

  public static void main (String[] args) throws IOException {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    String text = "Joe Smith is from Florida.";
    Annotation annotation = new Annotation(text);
    pipeline.annotate(annotation);
    System.out.println("---");
    System.out.println("text: " + text);
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        System.out.print(entityMention.get(CoreAnnotations.TextAnnotation.class));
        System.out.print("\t");
        System.out.print(
                entityMention.get(CoreAnnotations.NamedEntityTagAnnotation.class));
        System.out.println();
      }
    }
  }
}

这篇关于斯坦福大学NLP命名具有多个令牌的实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆