强制Stanford CoreNLP解析器在根级别对'S'标签进行优先级排序 [英] Force Stanford CoreNLP Parser to Prioritize 'S' Label at Root Level
问题描述
问候NLP专家
我正在使用Stanford CoreNLP软件包来生成选区解析,它使用从
I am using the Stanford CoreNLP software package to produce constituency parses, using the most recent version (3.9.2) of the English language models JAR, downloaded from the CoreNLP Download page. I access the parser via the Python interface from the NLTK module nltk.parse.corenlp. Here is a snippet from the top of my main module:
import nltk
from nltk.tree import ParentedTree
from nltk.parse.corenlp import CoreNLPParser
parser = CoreNLPParser(url='http://localhost:9000')
我还使用来自终端的以下(相当通用的)调用来启动服务器:
I also fire up the server using the following (fairly generic) call from the terminal:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
-annotators "parse" -port 9000 -timeout 30000
CoreNLP默认情况下(当有完整的英语模型可用时)选择的解析器是Shift-Reduce(SR)解析器,它
The parser that CoreNLP selects by default (when the full English model is available) is the Shift-Reduce (SR) parser, which is sometimes claimed to be both more accurate and faster than the CoreNLP PCFG parser. Impressionistically, I can corroborate that with my own experience, where I deal almost exclusively with Wikipedia text.
但是,我注意到解析器经常会错误地选择将实际上是完整句子(即有限矩阵子句)的句子作为子句式组成部分(通常是NP
)进行解析.换句话说,解析器应该在根级别(ROOT (S ...))
上输出S
标签,但是句子语法的复杂性使解析器认为句子不是句子(ROOT (NP ...))
,等等.
However, I have noticed that often the parser will erroneously opt for parsing what is in fact a complete sentence (i.e., a finite, matrix clause) as a subsentential constituent instead, often an NP
. In other words, the parser should be outputting an S
label at root level (ROOT (S ...))
, but something in the complexity of the sentence's syntax pushes the parser to say a sentence is not a sentence (ROOT (NP ...))
, etc.
此类问题句子的解析也总是在树的后面包含另一个(通常是明显的)错误.以下是一些示例.我将只粘贴每棵树的前几个级别以节省空间.每个句子都是一个完全可以接受的英语句子,因此解析应该全部以(ROOT (S ...))
开头.但是,在每种情况下,其他标签都代替了S
,而树的其余部分都是乱码.
The parses for such problem sentences also always contain another (usually glaring) error further down in the tree. Below are a few examples. I'll just paste in the top few levels of each tree to save space. Each is a perfectly acceptable English sentence, and so the parses should all begin (ROOT (S ...))
. However, in each case some other label takes the place of S
, and the rest of the tree is garbled.
NP:据估计,每年由于感冒而缺课22–189百万个教学日.
(ROOT (NP (NP An estimated 22) (: --) (S 189 million school days are missed annually due to a cold) (. .)))
NP: An estimated 22–189 million school days are missed annually due to a cold.
(ROOT (NP (NP An estimated 22) (: --) (S 189 million school days are missed annually due to a cold) (. .)))
FRAG :超过三分之一的看医生的人接受了抗生素处方,这可能会对抗生素耐药性产生影响. (ROOT (FRAG (NP (NP More than one-third) (PP of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance)) (. .)))
FRAG: More than one-third of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance. (ROOT (FRAG (NP (NP More than one-third) (PP of people who saw a doctor received an antibiotic prescription, which has implications for antibiotic resistance)) (. .)))
UCP:咖啡是用烘焙过的咖啡豆(某些咖啡品种的浆果种子)制成的冲泡饮料. (ROOT (UCP (S Coffee is a brewed drink prepared from roasted coffee beans) (, ,) (NP the seeds of berries from certain Coffea species) (. .)))
UCP: Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. (ROOT (UCP (S Coffee is a brewed drink prepared from roasted coffee beans) (, ,) (NP the seeds of berries from certain Coffea species) (. .)))
最后,这是我的问题,我相信以上证据证明是有用的:鉴于我的数据包含的碎片或其他形式不正确的句子数量可忽略不计,我该如何施加高分?限制在CoreNLP解析器上,以使其算法优先于直接在ROOT
下分配S
节点?
At long last, here is my question, which I trust the above evidence proves is a useful one: Given that my data contains a negligible number of fragments or otherwise ill-formed sentences, how can I impose a high-level constraint on the CoreNLP parser such that its algorithm gives priority to assigning an S
node directly below ROOT
?
我很好奇,看看在处理数据(人们知道要满足它)时施加这样的约束是否也可以治愈在所产生的解析中观察到的其他各种弊病.据我了解,解决方案不在于指定ParserAnnotations.ConstraintAnnotation
.会吗?
I am curious to see whether imposing such a constraint when processing data (that one knows to satisfy it) will also cure other myriad ills observed in the parses produced. From what I understand, the solution would not lie in specifying a ParserAnnotations.ConstraintAnnotation
. Would it?
推荐答案
您可以指定必须以某种方式标记的特定范围.因此,您可以说整个范围必须是"S".但是我认为您必须使用Java代码来做到这一点.
You can specify a certain range has to be marked a certain way. So you can say the entire range has to be an 'S'. But I think you have to do this in Java code.
下面是显示设置约束的示例代码.
Here is example code that shows setting constraints.
这篇关于强制Stanford CoreNLP解析器在根级别对'S'标签进行优先级排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!