斯坦福在 python 中使用 coreNLP 键入依赖项 [英] Stanford typed dependencies using coreNLP in python

查看:27
本文介绍了斯坦福在 python 中使用 coreNLP 键入依赖项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

解析器确实找到了 neg(born,not)

但是当我使用 stanfordnlp python 库时,我能得到的唯一依赖解析器将解析句子如下:

('Barack', '5', 'nsubj:pass')('奥巴马', '1', '平')('was', '5', 'aux:pass')('不是', '5', 'advmod')('出生','0','根')('in', '7', 'case')('夏威夷', '5', 'obl')

以及生成它的代码:

import stanfordnlpstanfordnlp.download('en')nlp = stanfordnlp.Pipeline()doc = nlp("巴拉克奥巴马不是出生在夏威夷")a = doc.sentences[0]a.print_dependencies()

有没有办法获得与增强型依赖解析器或任何其他导致类型化依赖的斯坦福解析器类似的结果,这会给我否定修饰符?

解决方案

需要注意的是 python 库 stanfordnlp 不仅仅是 StanfordCoreNLP 的 python 包装器.

1.区别 StanfordNLP/CoreNLP

正如 stanfordnlp Github 存储库 中所述:

<块引用>

斯坦福 NLP 小组的官方 Python NLP 库.它包含用于从 CoNLL 运行我们最新的全神经管道的软件包2018 共享任务,用于访问 Java Stanford CoreNLP 服务器.

Stanfordnlp 包含一组新的神经网络模型,在 CONLL 2018 共享任务上进行训练.在线解析器基于 CoreNLP 3.9.2 java 库.这是两个不同的管道和模型集,如此处所述.

您的代码仅访问他们在 CONLL 2018 数据上训练的神经管道.这解释了您看到的与在线版本相比的差异.这基本上是两种不同的模型.

我认为更令人困惑的是,两个存储库都属于名为 stanfordnlp 的用户(这是团队名称).不要在 java stanfordnlp/CoreNLP 和 python stanfordnlp/stanfordnlp 之间被愚弄.

关于您的否定"问题,似乎在 python libabry stanfordnlp 中,他们决定完全考虑使用advmod"注释进行否定.至少这是我在几个例句中遇到的.

2.通过 stanfordnlp 包使用 CoreNLP

但是,您仍然可以通过 stanfordnlp 包访问 CoreNLP.但是,它需要更多步骤.引用 Github 存储库,

<块引用>

有几个初始设置步骤.

一旦完成,您就可以启动一个客户端,其代码可以在 演示 :

from stanfordnlp.server import CoreNLPClient使用 CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') 作为客户端:# 向服务器提交请求安 = client.annotate(text)# 获取第一句话句子 = ann.sentence[0]# 获取第一句的依赖解析打印(' - -')print('第一句的依赖解析')dependency_parse = sentence.basicDependencies打印(依赖解析)#获取第一句的token#注意1个token是解析树中的1个节点,节点从1开始打印(' - -')print('第一句话的标记')对于 sentence.token 中的令牌:打印(令牌)

因此,如果您指定 'depparse' 注释器(以及必备注释器 tokenize、ssplit 和 pos),那么您的句子将被解析.看了demo,感觉只能访问basicDependencies.我还没有通过 stanfordnlp 使 Enhanced++ 依赖项工作.

但是如果您使用 basicDependencies ,否定仍然会出现!

这是我使用 stanfordnlp 和您的例句获得的输出.它是一个 DependencyGraph 对象,并不漂亮,但不幸的是,当我们使用非常深的 CoreNLP 工具时,情况总是如此.您将看到节点 4 和 5('not' 和 'born')之间存在边 'neg'.

节点{句子索引:0指数:1}节点{句子索引:0指数:2}节点{句子索引:0指数:3}节点{句子索引:0指数:4}节点{句子索引:0指数:5}节点{句子索引:0指数:6}节点{句子索引:0指数:7}节点{句子索引:0指数:8}边缘 {来源:2目标:1dep:复合"isExtra: 假源副本:0目标副本:0语言:通用英语}边缘 {来源:5目标:2dep:nsubjpass"isExtra: 假源副本:0目标副本:0语言:通用英语}边缘 {来源:5目标:3dep:辅助通行证"isExtra: 假源副本:0目标副本:0语言:通用英语}边缘 {来源:5目标:4dep:否定"isExtra: 假源副本:0目标副本:0语言:通用英语}边缘 {来源:5目标:7深度:nmod"isExtra: 假源副本:0目标副本:0语言:通用英语}边缘 {来源:5目标:8dep:点"isExtra: 假源副本:0目标副本:0语言:通用英语}边缘 {来源:7目标:6dep:案例"isExtra: 假源副本:0目标副本:0语言:通用英语}根:5---第一句话的记号词:巴拉克"pos:NNP"值:巴拉克"前: ""后: " "原文:巴拉克"开始字符:0结束字符:6令牌开始索引:0令牌结束索引:1hasXmlContext: 假isNewline: 假词:奥巴马"pos:NNP"价值:奥巴马"前: " "后: " "原文:奥巴马"开始字符:7结束字符:12令牌开始索引:1令牌结束索引:2hasXmlContext: 假isNewline: 假词:是"位置:VBD"值:是"前: " "后: " "原文:是"开始字符:13结束字符:16令牌开始索引:2令牌结束索引:3hasXmlContext: 假isNewline: 假词:不"位置:RB"值:不"前: " "后: " "原文:不是"开始字符:17结束字符:20令牌开始索引:3令牌结束索引:4hasXmlContext: 假isNewline: 假词:出生"位置:VBN"价值:出生"前: " "后: " "原文:出生"开始字符:21结束字符:25令牌开始索引:4令牌结束索引:5hasXmlContext: 假isNewline: 假词:在"位置:在"值:在"前: " "后: " "原文:在"开始字符:26结束字符:28令牌开始索引:5令牌结束索引:6hasXmlContext: 假isNewline: 假词:夏威夷"pos:NNP"值:夏威夷"前: " "后: ""原文:夏威夷"开始字符:29结束字符:35令牌开始索引:6令牌结束索引:7hasXmlContext: 假isNewline: 假单词: "."位置:."价值: "."前: ""后: ""原文:."开始字符:35结束字符:36令牌开始索引:7令牌结束索引:8hasXmlContext: 假isNewline: 假

2.通过 NLTK 包使用 CoreNLP

我不会详细介绍这个,但如果其他方法都失败了,还有一个解决方案可以通过 NLTK 库访问 CoreNLP 服务器.它确实输出否定,但需要更多的工作来启动服务器.此页面

的详细信息

编辑

我想我还可以与您分享代码,将 DependencyGraph 放入一个很好的dependency、argument1、argument2"列表中,其形状类似于 stanfordnlp 的输出.

from stanfordnlp.server import CoreNLPClienttext = "巴拉克奥巴马不是出生在夏威夷."# 设置客户端使用 CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') 作为客户端:# 向服务器提交请求安 = client.annotate(text)# 获取第一句话句子 = ann.sentence[0]# 获取第一句的依赖解析dependency_parse = sentence.basicDependencies#print(dir(sentence.token[0])) #找到一个Token对象的所有属性和方法#print(dir(dependency_parse)) #查找一个DependencyGraph对象的所有属性和方法#print(dir(dependency_parse.edge))#获取一个字典,将每个标记/节点与其标签相关联token_dict = {}对于范围内的 i(0, len(sentence.token)) :token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word#获取依赖项列表以及它们连接的单词list_dep=[]对于范围内的 i(0, len(dependency_parse.edge)):source_node = dependency_parse.edge[i].sourcesource_name = token_dict[source_node]target_node = dependency_parse.edge[i].targettarget_name = token_dict[target_node]dep = dependency_parse.edge[i].deplist_dep.append((dep,str(source_node)+'-'+source_name,str(target_node)+'-'+target_name))打印(list_dep)

输出如下

[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]

In Stanford Dependency Manual they mention "Stanford typed dependencies" and particularly the type "neg" - negation modifier. It is also available when using Stanford enhanced++ parser using the website. for example, the sentence:

"Barack Obama was not born in Hawaii"

The parser indeed find neg(born,not)

but when I'm using the stanfordnlp python library, the only dependency parser I can get will parse the sentence as follow:

('Barack', '5', 'nsubj:pass')

('Obama', '1', 'flat')

('was', '5', 'aux:pass')

('not', '5', 'advmod')

('born', '0', 'root')

('in', '7', 'case')

('Hawaii', '5', 'obl')

and the code that generates it:

import stanfordnlp
stanfordnlp.download('en')  
nlp = stanfordnlp.Pipeline()
doc = nlp("Barack Obama was not born in Hawaii")
a  = doc.sentences[0]
a.print_dependencies()

Is there a way to get similar results to the enhanced dependency parser or any other Stanford parser that result in typed dependencies that will give me the negation modifier?

解决方案

It is to note the python library stanfordnlp is not just a python wrapper for StanfordCoreNLP.

1. Difference StanfordNLP / CoreNLP

As said on the stanfordnlp Github repo:

The Stanford NLP Group's official Python NLP library. It contains packages for running our latest fully neural pipeline from the CoNLL 2018 Shared Task and for accessing the Java Stanford CoreNLP server.

Stanfordnlp contains a new set of neural networks models, trained on the CONLL 2018 shared task. The online parser is based on the CoreNLP 3.9.2 java library. Those are two different pipelines and sets of models, as explained here.

Your code only accesses their neural pipeline trained on CONLL 2018 data. This explains the differences you saw compared to the online version. Those are basically two different models.

What adds to the confusion I believe is that both repositories belong to the user named stanfordnlp (which is the team name). Don't be fooled between the java stanfordnlp/CoreNLP and the python stanfordnlp/stanfordnlp.

Concerning your 'neg' issue, it seems that in the python libabry stanfordnlp, they decided to consider the negation with an 'advmod' annotation altogether. At least that is what I ran into for a few example sentences.

2. Using CoreNLP via stanfordnlp package

However, you can still get access to the CoreNLP through the stanfordnlp package. It requires a few more steps, though. Citing the Github repo,

There are a few initial setup steps.

  • Download Stanford CoreNLP and models for the language you wish to use. (you can download CoreNLP and the language models here)
  • Put the model jars in the distribution folder
  • Tell the python code where Stanford CoreNLP is located: export CORENLP_HOME=/path/to/stanford-corenlp-full-2018-10-05

Once that is done, you can start a client, with code that can be found in the demo :

from stanfordnlp.server import CoreNLPClient 

with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    # get the dependency parse of the first sentence
    print('---')
    print('dependency parse of first sentence')
    dependency_parse = sentence.basicDependencies
    print(dependency_parse)

    #get the tokens of the first sentence
    #note that 1 token is 1 node in the parse tree, nodes start at 1
    print('---')
    print('Tokens of first sentence')
    for token in sentence.token :
        print(token)

Your sentence will therefore be parsed if you specify the 'depparse' annotator (as well as the prerequisite annotators tokenize, ssplit, and pos). Reading the demo, it feels that we can only access basicDependencies. I have not managed to make Enhanced++ dependencies work via stanfordnlp.

But the negations will still appear if you use basicDependencies !

Here is the output I obtained using stanfordnlp and your example sentence. It is a DependencyGraph object, not pretty, but it is unfortunately always the case when we use the very deep CoreNLP tools. You will see that between nodes 4 and 5 ('not' and 'born'), there is and edge 'neg'.

node {
  sentenceIndex: 0
  index: 1
}
node {
  sentenceIndex: 0
  index: 2
}
node {
  sentenceIndex: 0
  index: 3
}
node {
  sentenceIndex: 0
  index: 4
}
node {
  sentenceIndex: 0
  index: 5
}
node {
  sentenceIndex: 0
  index: 6
}
node {
  sentenceIndex: 0
  index: 7
}
node {
  sentenceIndex: 0
  index: 8
}
edge {
  source: 2
  target: 1
  dep: "compound"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 2
  dep: "nsubjpass"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 3
  dep: "auxpass"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 4
  dep: "neg"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 7
  dep: "nmod"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 5
  target: 8
  dep: "punct"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
edge {
  source: 7
  target: 6
  dep: "case"
  isExtra: false
  sourceCopy: 0
  targetCopy: 0
  language: UniversalEnglish
}
root: 5

---
Tokens of first sentence
word: "Barack"
pos: "NNP"
value: "Barack"
before: ""
after: " "
originalText: "Barack"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false

word: "Obama"
pos: "NNP"
value: "Obama"
before: " "
after: " "
originalText: "Obama"
beginChar: 7
endChar: 12
tokenBeginIndex: 1
tokenEndIndex: 2
hasXmlContext: false
isNewline: false

word: "was"
pos: "VBD"
value: "was"
before: " "
after: " "
originalText: "was"
beginChar: 13
endChar: 16
tokenBeginIndex: 2
tokenEndIndex: 3
hasXmlContext: false
isNewline: false

word: "not"
pos: "RB"
value: "not"
before: " "
after: " "
originalText: "not"
beginChar: 17
endChar: 20
tokenBeginIndex: 3
tokenEndIndex: 4
hasXmlContext: false
isNewline: false

word: "born"
pos: "VBN"
value: "born"
before: " "
after: " "
originalText: "born"
beginChar: 21
endChar: 25
tokenBeginIndex: 4
tokenEndIndex: 5
hasXmlContext: false
isNewline: false

word: "in"
pos: "IN"
value: "in"
before: " "
after: " "
originalText: "in"
beginChar: 26
endChar: 28
tokenBeginIndex: 5
tokenEndIndex: 6
hasXmlContext: false
isNewline: false

word: "Hawaii"
pos: "NNP"
value: "Hawaii"
before: " "
after: ""
originalText: "Hawaii"
beginChar: 29
endChar: 35
tokenBeginIndex: 6
tokenEndIndex: 7
hasXmlContext: false
isNewline: false

word: "."
pos: "."
value: "."
before: ""
after: ""
originalText: "."
beginChar: 35
endChar: 36
tokenBeginIndex: 7
tokenEndIndex: 8
hasXmlContext: false
isNewline: false

2. Using CoreNLP via NLTK package

I will not go into details on this one, but there is also a solution to access the CoreNLP server via the NLTK library , if all else fails. It does output the negations, but requires a little more work to start the servers. Details on this page

EDIT

I figured I could also share with you the code to get the DependencyGraph into a nice list of 'dependency, argument1, argument2' in a shape similar to what stanfordnlp outputs.

from stanfordnlp.server import CoreNLPClient

text = "Barack Obama was not born in Hawaii."

# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','depparse'], timeout=60000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)

    # get the first sentence
    sentence = ann.sentence[0]

    # get the dependency parse of the first sentence
    dependency_parse = sentence.basicDependencies

    #print(dir(sentence.token[0])) #to find all the attributes and methods of a Token object
    #print(dir(dependency_parse)) #to find all the attributes and methods of a DependencyGraph object
    #print(dir(dependency_parse.edge))

    #get a dictionary associating each token/node with its label
    token_dict = {}
    for i in range(0, len(sentence.token)) :
        token_dict[sentence.token[i].tokenEndIndex] = sentence.token[i].word

    #get a list of the dependencies with the words they connect
    list_dep=[]
    for i in range(0, len(dependency_parse.edge)):

        source_node = dependency_parse.edge[i].source
        source_name = token_dict[source_node]

        target_node = dependency_parse.edge[i].target
        target_name = token_dict[target_node]

        dep = dependency_parse.edge[i].dep

        list_dep.append((dep, 
            str(source_node)+'-'+source_name, 
            str(target_node)+'-'+target_name))
    print(list_dep)

It ouputs the following

[('compound', '2-Obama', '1-Barack'), ('nsubjpass', '5-born', '2-Obama'), ('auxpass', '5-born', '3-was'), ('neg', '5-born', '4-not'), ('nmod', '5-born', '7-Hawaii'), ('punct', '5-born', '8-.'), ('case', '7-Hawaii', '6-in')]

这篇关于斯坦福在 python 中使用 coreNLP 键入依赖项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆