格式化来自Stanford Corenlp的NER输出 [英] Formatting NER output from Stanford Corenlp

查看:203
本文介绍了格式化来自Stanford Corenlp的NER输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与Stanford CoreNLP合作,并将其用于NER.但是,当我提取组织名称时,我看到每个单词都带有注释标记.因此,如果该实体是纽约时间",那么它将被记录为三个不同的实体:"NEW","YORK"和"TIMES".我们可以在Stanford COreNLP中设置一个属性,以便我们将合并后的输出作为实体吗?

I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". Is there a property we can set in the Stanford COreNLP so that we could get the combined output as the entity ?

就像在Stanford NER中一样,当我们使用命令行实用程序时,我们可以选择以下输出格式:inlineXML?我们可以以某种方式设置一个属性来选择Stanford CoreNLP中的输出格式吗?

Just like in Stanford NER, when we use command line utility, we can choose out output format as : inlineXML ? Can we somehow set a property to select the output format in Stanford CoreNLP ?

推荐答案

如果只需要斯坦福大学NER找到的每个命名实体的完整字符串,请尝试以下操作:

If you just want the complete strings of each named entity found by Stanford NER, try this:

String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
    System.out.println(text.substring(entity.second, entity.third), entity.second));

如果您想知道,实体类由entity.first表示.

In case you're wondering, the entity class is indicated by entity.first.

或者,您可以使用ner.classifyWithInlineXML(text)来获取类似于<PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

Alternatively, you can use ner.classifyWithInlineXML(text) to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

这篇关于格式化来自Stanford Corenlp的NER输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆