如何在斯坦福依赖项解析器中保留标点符号 [英] How to keep punctuation in Stanford dependency parser
问题描述
我正在使用Stanford CoreNLP(2016年1月1日版本),并且我希望在依赖关系中保留标点符号.当您从命令行运行它时,我已经找到了一些方法来执行此操作,但是对于提取依赖关系的Java代码,我什么都没找到.
I am using Stanford CoreNLP (01.2016 version) and I would like to keep the punctuation in the dependency relations. I have found some ways for doing that when you run it from command line, but I didn't find anything regarding the java code which extracts the dependency relations.
这是我当前的代码.它可以工作,但不包括标点符号:
Here is my current code. It works, but no punctuation is included:
Annotation document = new Annotation(text);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse");
props.setProperty("ssplit.newlineIsSentenceBreak", "always");
props.setProperty("ssplit.eolonly", "true");
props.setProperty("pos.model", modelPath1);
props.put("parse.model", modelPath );
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
LexicalizedParser lp = LexicalizedParser.loadModel(modelPath + lexparserNameEn,
"-maxLength", "200", "-retainTmpSubcategories");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
List<CoreLabel> words = sentence.get(CoreAnnotations.TokensAnnotation.class);
Tree parse = lp.apply(words);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection<TypedDependency> td = gs.typedDependencies();
parsedText += td.toString() + "\n";
对于我来说,任何一种依赖关系都可以,基本的,类型化的,折叠的等等. 我只想包含标点符号.
Any kind of dependency relation is OK for me, basic, typed, collapsed, etc. I just want to include the punctuation marks.
预先感谢
推荐答案
在这里,您正在做大量的额外工作,因为您一次通过CoreNLP运行解析器,然后再次调用lp.apply(words)
.
You are doing quite a bit of extra work here as you are running the parser once through CoreNLP and then again by calling lp.apply(words)
.
获取带有标点符号的依赖树/图形的最简单方法是使用CoreNLP选项parse.keepPunct
,如下所示.
The easiest way of getting a dependency tree/graph with punctuation marks is by using the CoreNLP option parse.keepPunct
as following.
Annotation document = new Annotation(text);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse");
props.setProperty("ssplit.newlineIsSentenceBreak", "always");
props.setProperty("ssplit.eolonly", "true");
props.setProperty("pos.model", modelPath1);
props.setProperty("parse.model", modelPath);
props.setProperty("parse.keepPunct", "true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap sentence : sentences) {
//Pick whichever representation you want
SemanticGraph basicDeps = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
SemanticGraph collapsed = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
SemanticGraph ccProcessed = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
}
句子注释对象将依赖关系树/图形存储为SemanticGraph
.如果要TypedDependency
对象的列表,请使用方法typedDependencies()
.例如,
The sentence annotation object stores the dependency trees/graphs as a SemanticGraph
. If you want a list of TypedDependency
objects, use the method typedDependencies()
. For example,
List<TypedDependency> dependencies = basicDeps.typedDependencies();
这篇关于如何在斯坦福依赖项解析器中保留标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!