斯坦福大学对 Python NLTK 的通用依赖 [英] Stanford Universal Dependencies on Python NLTK
问题描述
有什么办法可以使用python或nltk获取通用依赖项吗?我只能生成解析树.
示例:
输入语句:
我的狗也喜欢吃香肠.
输出:
通用依赖nmod:poss(dog-2, My-1)nsubj(likes-4, dog-2)advmod(likes-4, also-3)根(ROOT-0,喜欢-4)xcomp(likes-4,eat-5)dobj(吃5,香肠6)
Wordseer 的 stanford-corenlp-pythonfork 是一个好的开始,因为它适用于最近的 CoreNLP 版本 (3.5.2).但是它会给你原始输出,你需要手动转换.例如,假设您正在运行包装器:
<预><代码>>>>导入 json, jsonrpclib>>>从 pprint 导入 pprint>>>>>>server = jsonrpclib.Server("http://localhost:8080")>>>>>>pprint(json.loads(server.parse('John love Mary.'))) # doctest: +SKIP{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],[u'nsubj',你爱,你'2',你约翰,你'1'],[u'dobj', u'loves', u'2', u'Mary', u'3'],[u'punct', u'loves', u'2', u'.', u'4']],你'解析树':[],u'text': u'John 爱 Mary.',u'words': [[u'John',{u'CharacterOffsetBegin': u'0',u'CharacterOffsetEnd': u'4',u'Lemma': u'John',u'PartOfSpeech': u'NNP'}],[你爱,{u'CharacterOffsetBegin': u'5',u'CharacterOffsetEnd': u'10',你'引理':你'爱',u'PartOfSpeech': u'VBZ'}],[你'玛丽',{u'CharacterOffsetBegin': u'11',u'CharacterOffsetEnd': u'15',u'Lemma': u'Mary',u'PartOfSpeech': u'NNP'}],[你'.',{u'CharacterOffsetBegin': u'15',u'CharacterOffsetEnd': u'16',u'Lemma': u'.',u'PartOfSpeech': u'.'}]]}]}如果你想使用依赖解析器,你可以稍微努力重用NLTK的DependencyGraph
<预><代码>>>>导入 jsonrpclib, json>>>从 nltk.parse 导入 DependencyGraph>>>>>>server = jsonrpclib.Server("http://localhost:8080")>>>解析 = json.loads(... server.parse(......'约翰爱玛丽.'......'我看到了一个拿着望远镜的人.'……鲍尔默过去曾直言不讳地警告说,Linux 对微软构成了威胁."……)... )['句子']>>>>>>定义转换(句子):... rel, _, head, word, n in sentence['dependencies']:... n = int(n)...... word_info = 句子['words'][n - 1][1]... tag = word_info['PartOfSpeech']... 引理 = word_info['引理']...如果 rel == 'root':... # NLTK 期望根关系被标记为 ROOT!... rel = 'ROOT'...... # Hack:返回我们不知道的值 '_'.... # 另外,考虑 tag 和 ctag 是相等的.... # n 用于对句子中出现的单词进行排序.... yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'...>>>dgs = [...依赖图(... ' '.join(items) # NLTK 需要一个可迭代的字符串...... 对于 n, *items in sorted(transform(parse))……)...用于解析中的解析...]>>>>>># 玩弄我们得到的信息.>>>>>>pprint(list(dgs[0].triples()))[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),(('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),(('loves', 'VBZ'), 'punct', ('.', '.'))]>>>>>>打印(dgs [1].树())(看到我(男人a(带(望远镜a))).)>>>>>>打印(dgs[2].to_conll(4))# doctest:+NORMALIZE_WHITESPACE鲍尔默 NNP 4 nsubj有 VBZ 4 辅助一直 VBN 4 警察人声 JJ 0 ROOT在 IN 4 准备DT 8 det过去的 JJ 8 amod警告 NN 5 pobjWDT 13 dobjLinux NNP 13 nsubj是 VBZ 13 警察一个 DT 13 det威胁 NN 8 rcmod到 13 岁准备微软 NNP 14 pobj..4 点<空白线>设置 CoreNLP 并不难,查看 http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html 了解更多详情.
Is there any way I can get the Universal dependencies using python, or nltk?I can only produce the parse tree.
Example:
Input sentence:
My dog also likes eating sausage.
Output:
Universal dependencies
nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)
Wordseer's stanford-corenlp-python fork is a good start as it works with the recent CoreNLP release (3.5.2). However it will give you raw output, which you need manually transform. For example, given you have the wrapper running:
>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.'))) # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
[u'nsubj',
u'loves',
u'2',
u'John',
u'1'],
[u'dobj', u'loves', u'2', u'Mary', u'3'],
[u'punct', u'loves', u'2', u'.', u'4']],
u'parsetree': [],
u'text': u'John loves Mary.',
u'words': [[u'John',
{u'CharacterOffsetBegin': u'0',
u'CharacterOffsetEnd': u'4',
u'Lemma': u'John',
u'PartOfSpeech': u'NNP'}],
[u'loves',
{u'CharacterOffsetBegin': u'5',
u'CharacterOffsetEnd': u'10',
u'Lemma': u'love',
u'PartOfSpeech': u'VBZ'}],
[u'Mary',
{u'CharacterOffsetBegin': u'11',
u'CharacterOffsetEnd': u'15',
u'Lemma': u'Mary',
u'PartOfSpeech': u'NNP'}],
[u'.',
{u'CharacterOffsetBegin': u'15',
u'CharacterOffsetEnd': u'16',
u'Lemma': u'.',
u'PartOfSpeech': u'.'}]]}]}
In case you want to use dependency parser, you can reuse NLTK's DependencyGraph with a bit of effort
>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
... server.parse(
... 'John loves Mary. '
... 'I saw a man with a telescope. '
... 'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
... )
... )['sentences']
>>>
>>> def transform(sentence):
... for rel, _, head, word, n in sentence['dependencies']:
... n = int(n)
...
... word_info = sentence['words'][n - 1][1]
... tag = word_info['PartOfSpeech']
... lemma = word_info['Lemma']
... if rel == 'root':
... # NLTK expects that the root relation is labelled as ROOT!
... rel = 'ROOT'
...
... # Hack: Return values we don't know as '_'.
... # Also, consider tag and ctag to be equal.
... # n is used to sort words as they appear in the sentence.
... yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
... DependencyGraph(
... ' '.join(items) # NLTK expects an iterable of strings...
... for n, *items in sorted(transform(parse))
... )
... for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
(('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
(('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4)) # doctest: +NORMALIZE_WHITESPACE
Ballmer NNP 4 nsubj
has VBZ 4 aux
been VBN 4 cop
vocal JJ 0 ROOT
in IN 4 prep
the DT 8 det
past JJ 8 amod
warning NN 5 pobj
that WDT 13 dobj
Linux NNP 13 nsubj
is VBZ 13 cop
a DT 13 det
threat NN 8 rcmod
to TO 13 prep
Microsoft NNP 14 pobj
. . 4 punct
<BLANKLINE>
Setting up CoreNLP is not that hard, check http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html for more details.
这篇关于斯坦福大学对 Python NLTK 的通用依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!