Stanford CoreNLP:使用部分现有注释 [英] Stanford CoreNLP: Use partial existing annotation

查看:133
本文介绍了Stanford CoreNLP:使用部分现有注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试使用现有的

We are trying to use existing

  • 令牌化
  • 句子拆分
  • 和命名实体标记

而我们想使用斯坦福大学的CoreNlp来另外为我们提供

while we would like to use Stanford CoreNlp to additionally provide us with

  • 词性标签
  • 词素化
  • 并解析

当前,我们正在尝试以下方法:

Currently, we are trying it the following way:

1)为"pos,引理,解析"做一个注释器

1) make an annotator for "pos, lemma, parse"

Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);

2)使用自定义方法阅读句子:

2) read in the sentences, with a custom method:

List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));

在该方法中,令牌的构建方式如下:

within that method, the tokens are constructed the following way:

CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);

它们被组合成这样的句子:

and they are combined into sentences like this:

Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());

3)将句子列表传递到管道:

3) the list of sentences is passed to the pipeline:

  Annotation document = new Annotation(sentences);
  pipeline.annotate(document);


但是,运行此命令时,会出现以下错误:


However, when running this, we get the following error:

null: InvocationTargetException: annotator "pos" requires annotator "tokenize"


任何指针如何实现我们想要做的事情?


Any pointers how we can achieve what we want to do?

推荐答案

由于"pos"注释器(

The exception is thrown due to unsatisfied requirement expected by "pos" annotator (an instance of POSTaggerAnnotator class)

StanfordCoreNLP知道如何创建注释者的要求在注释器接口.对于"pos"注释器,定义了2个要求:

Requirements for annotators which StanfordCoreNLP knows how to create are defined in Annotator interface. For the case of "pos" annotator there are 2 requirements defined:

  • 令牌化
  • 分裂

这两个要求都必须满足,这意味着必须在"pos"注释者之前的注释者列表中同时指定"tokenize"注释者和"split"注释者.

Both of this requirements needs to be satisfied, which means that both "tokenize" annotator and "ssplit" annotator must be specified in annotators list before "pos" annotator.

现在回到问题...如果您想在管道中跳过"tokenize"和"ssplit"注释,则需要禁用在管道初始化期间执行的需求检查.我发现了两种等效的方法:

Now back to the question... If you like to skip "tokenize" and "ssplit" annotations in your pipeline you need to disable requirements check which is performed during initialization of the pipeline. I found two equivalent ways how this can be done:

  • 在传递给StanfordCoreNLP构造函数的属性对象中禁用需求强制实施:

  • Disable requirements enforcement in properties object passed to StanfordCoreNLP constructor:

props.setProperty("enforceRequirements", "false");

将StanfordCoreNLP构造函数的executeRequirements参数设置为false

Set enforceRequirements parameter of StanfordCoreNLP constructor to false

StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);

这篇关于Stanford CoreNLP:使用部分现有注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆