格式化来自Stanford Corenlp的NER输出 [英] Formatting NER output from Stanford Corenlp
问题描述
我正在与Stanford CoreNLP合作,并将其用于NER.但是,当我提取组织名称时,我看到每个单词都带有注释标记.因此,如果该实体是纽约时间",那么它将被记录为三个不同的实体:"NEW","YORK"和"TIMES".我们可以在Stanford COreNLP中设置一个属性,以便我们将合并后的输出作为实体吗?
I am working with Stanford CoreNLP and using it for NER. But when I extract organization names, I see that each word is tagged with the annotation. So, if the entity is "NEW YORK TIMES", then it is getting recorded as three different entities : "NEW", "YORK" and "TIMES". Is there a property we can set in the Stanford COreNLP so that we could get the combined output as the entity ?
就像在Stanford NER中一样,当我们使用命令行实用程序时,我们可以选择以下输出格式:inlineXML?我们可以以某种方式设置一个属性来选择Stanford CoreNLP中的输出格式吗?
Just like in Stanford NER, when we use command line utility, we can choose out output format as : inlineXML ? Can we somehow set a property to select the output format in Stanford CoreNLP ?
推荐答案
如果只需要斯坦福大学NER找到的每个命名实体的完整字符串,请尝试以下操作:
If you just want the complete strings of each named entity found by Stanford NER, try this:
String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
System.out.println(text.substring(entity.second, entity.third), entity.second));
如果您想知道,实体类由entity.first
表示.
In case you're wondering, the entity class is indicated by entity.first
.
或者,您可以使用ner.classifyWithInlineXML(text)
来获取类似于<PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .
Alternatively, you can use ner.classifyWithInlineXML(text)
to get output that looks like <PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .
这篇关于格式化来自Stanford Corenlp的NER输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!