使用Stanford CoreNLP的共指解析 [英] Coreference resolution using Stanford CoreNLP

查看:405
本文介绍了使用Stanford CoreNLP的共指解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Stanford CoreNLP工具包的新手,正在尝试将其用于解决新闻文本中的共同引用的项目.为了使用Stanford CoreNLP共参考系统,我们通常会创建一个管道,该管道需要标记化,句子拆分,词性标记,词缀化,命名实体重新识别和解析.例如:

I am new to the Stanford CoreNLP toolkit and trying to use it for a project to resolve coreferences in news texts. In order to use the Stanford CoreNLP coreference system, we would usually create a pipeline, which requires tokenization, sentence splitting, part-of-speech tagging, lemmarization, named entity recoginition and parsing. For example:

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = "As competition heats up in Spain's crowded bank market, Banco Exterior de Espana is seeking to shed its image of a state-owned bank and move into new activities.";

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

然后我们可以轻松地通过以下方式获得句子注释:

Then we can easily get the sentence annotations with:

List<CoreMap> sentences = document.get(SentencesAnnotation.class);

但是,我正在使用其他工具进行预处理,只需要一个独立的共指解析系统即可.创建标记并解析树注释并将它们设置为注释非常容易:

However, I am using other tools for for preprocessing and just need a stand-alone coreference resolution system. It is pretty easy to create tokens and parse tree annotations and set them to the annotation:

// create new annotation
Annotation annotation = new Annotation();

// create token annotations for each sentence from the input file
List<CoreLabel> tokens = new ArrayList<>();
for(int tokenCount = 0; tokenCount < parsedSentence.size(); tokenCount++) {

        ArrayList<String> parsedLine = parsedSentence.get(tokenCount);
        String word = parsedLine.get(1);
        String lemma = parsedLine.get(2);
        String posTag = parsedLine.get(3);
        String namedEntity = parsedLine.get(4); 
        String partOfParseTree = parsedLine.get(6);

        CoreLabel token = new CoreLabel();
        token.setWord(word);
        token.setWord(lemma);
        token.setTag(posTag);
        token.setNER(namedEntity);
        tokens.add(token);
    }

// set tokens annotations to annotation
annotation.set(TokensAnnotation.class, tokens);

// set parse tree annotations to annotation
Tree stanfordParseTree = Tree.valueOf(inputParseTree);
annotation.set(TreeAnnotation.class, stanfordParseTree);

但是,创建句子注释非常棘手,因为据我所知,没有文档可以对其进行详细说明.我能够为句子注释创建数据结构并将其设置为注释:

However, creating sentence annotations is pretty tricky, because to my knowledge there is no document to explain it in full detail. I am able to create the data structure for the sentence annotations and set it to the annotation:

List<CoreMap> sentences = new ArrayList<CoreMap>();
annotation.set(SentencesAnnotation.class, sentences);

我确信这并不难,但是没有文档说明如何从标记注释创建句子注释,即如何用实际的句子注释填充ArrayList.

I am sure it cannot be that difficult, but there is no documentation on how to create sentence annotation from tokens annotations, i.e. how to fill the ArrayList with actual sentence annotations.

有什么想法吗?

顺便说一句,如果我使用处理工具提供的标记和语法分析树注解,并且仅使用StanfordCoreNLP管道提供的句子注解并应用StanfordCoreNLP独立的共指解析系统,我将获得正确的结果.因此,完整的独立共指解析系统唯一缺少的部分是能够从标记注释中创建句子注释.

Btw, if I use the tokens and parse tree annotations provided by my processing tools and only use the sentence annotations provided by the StanfordCoreNLP pipeline and apply the StanfordCoreNLP stand-alone coreference resolution system I am getting the correct results. So the only part missing for a complete stand-alone coreference resolution system is the ability to create the sentence annotations from the tokens annotations.

推荐答案

有一个Annotation

There is a Annotation constructor with a List<CoreMap> sentences argument which sets up the document if you have a list of already tokenized sentences.

要为每个句子创建一个CoreMap对象,如下所示. (请注意,我还分别向每个句子和标记对象添加了一个句子和标记索引.)

For each sentence you want to create a CoreMap object as following. (Note that I also added a sentence and token index to each sentence and token object, respectively.)

int sentenceIdx = 1;
List<CoreMap> sentences = new ArrayList<CoreMap>();
for (parsedSentence : parsedSentences) {
    CoreMap sentence = new CoreLabel();
    List<CoreLabel> tokens = new ArrayList<>();
    for(int tokenCount = 0; tokenCount < parsedSentence.size(); tokenCount++) {

        ArrayList<String> parsedLine = parsedSentence.get(tokenCount);
        String word = parsedLine.get(1);
        String lemma = parsedLine.get(2);
        String posTag = parsedLine.get(3);
        String namedEntity = parsedLine.get(4); 
        String partOfParseTree = parsedLine.get(6);

        CoreLabel token = new CoreLabel();
        token.setWord(word);
        token.setLemma(lemma);
        token.setTag(posTag);
        token.setNER(namedEntity);
        token.setIndex(tokenCount + 1);
        tokens.add(token);
    }

    // set tokens annotations and id of sentence 
    sentence.set(TokensAnnotation.class, tokens);
    sentence.set(SentenceIndexAnnotation.class, sentenceIdx++);

    // set parse tree annotations to annotation
    Tree stanfordParseTree = Tree.valueOf(inputParseTree);
    sentence.set(TreeAnnotation.class, stanfordParseTree);

    // add sentence to list of sentences
    sentences.add(sentence);
}

然后您可以使用sentences列表创建一个Annotation实例:

Then you can create an Annotation instance with the sentences list:

Annotation annotation = new Annotation(sentences);

这篇关于使用Stanford CoreNLP的共指解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆