斯坦福大学对 Python NLTK 的通用依赖 [英] Stanford Universal Dependencies on Python NLTK

查看:21
本文介绍了斯坦福大学对 Python NLTK 的通用依赖的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么办法可以使用python或nltk获取通用依赖项吗?我只能生成解析树.

示例:

输入语句:

我的狗也喜欢吃香肠.

输出:

通用依赖nmod:poss(dog-2, My-1)nsubj(likes-4, dog-2)advmod(likes-4, also-3)根(ROOT-0,喜欢-4)xcomp(likes-4,eat-5)dobj(吃5,香肠6)

解决方案

Wordseer 的 stanford-corenlp-pythonfork 是一个好的开始,因为它适用于最近的 CoreNLP 版本 (3.5.2).但是它会给你原始输出,你需要手动转换.例如,假设您正在运行包装器:

<预><代码>>>>导入 json, jsonrpclib>>>从 pprint 导入 pprint>>>>>>server = jsonrpclib.Server("http://localhost:8080")>>>>>>pprint(json.loads(server.parse('John love Mary.'))) # doctest: +SKIP{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],[u'nsubj',你爱,你'2',你约翰,你'1'],[u'dobj', u'loves', u'2', u'Mary', u'3'],[u'punct', u'loves', u'2', u'.', u'4']],你'解析树':[],u'text': u'John 爱 Mary.',u'words': [[u'John',{u'CharacterOffsetBegin': u'0',u'CharacterOffsetEnd': u'4',u'Lemma': u'John',u'PartOfSpeech': u'NNP'}],[你爱,{u'CharacterOffsetBegin': u'5',u'CharacterOffsetEnd': u'10',你'引理':你'爱',u'PartOfSpeech': u'VBZ'}],[你'玛丽',{u'CharacterOffsetBegin': u'11',u'CharacterOffsetEnd': u'15',u'Lemma': u'Mary',u'PartOfSpeech': u'NNP'}],[你'.',{u'CharacterOffsetBegin': u'15',u'CharacterOffsetEnd': u'16',u'Lemma': u'.',u'PartOfSpeech': u'.'}]]}]}

如果你想使用依赖解析器,你可以稍微努力重用NLTK的DependencyGraph

<预><代码>>>>导入 jsonrpclib, json>>>从 nltk.parse 导入 DependencyGraph>>>>>>server = jsonrpclib.Server("http://localhost:8080")>>>解析 = json.loads(... server.parse(......'约翰爱玛丽.'......'我看到了一个拿着望远镜的人.'……鲍尔默过去曾直言不讳地警告说,Linux 对微软构成了威胁."……)... )['句子']>>>>>>定义转换(句子):... rel, _, head, word, n in sentence['dependencies']:... n = int(n)...... word_info = 句子['words'][n - 1][1]... tag = word_info['PartOfSpeech']... 引理 = word_info['引理']...如果 rel == 'root':... # NLTK 期望根关系被标记为 ROOT!... rel = 'ROOT'...... # Hack:返回我们不知道的值 '_'.... # 另外,考虑 tag 和 ctag 是相等的.... # n 用于对句子中出现的单词进行排序.... yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'...>>>dgs = [...依赖图(... ' '.join(items) # NLTK 需要一个可迭代的字符串...... 对于 n, *items in sorted(transform(parse))……)...用于解析中的解析...]>>>>>># 玩弄我们得到的信息.>>>>>>pprint(list(dgs[0].triples()))[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),(('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),(('loves', 'VBZ'), 'punct', ('.', '.'))]>>>>>>打印(dgs [1].树())(看到我(男人a(带(望远镜a))).)>>>>>>打印(dgs[2].to_conll(4))# doctest:+NORMALIZE_WHITESPACE鲍尔默 NNP 4 nsubj有 VBZ 4 辅助一直 VBN 4 警察人声 JJ 0 ROOT在 IN 4 准备DT 8 det过去的 JJ 8 amod警告 NN 5 pobjWDT 13 dobjLinux NNP 13 nsubj是 VBZ 13 警察一个 DT 13 det威胁 NN 8 rcmod到 13 岁准备微软 NNP 14 pobj..4 点<空白线>

设置 CoreNLP 并不难,查看 http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html 了解更多详情.

Is there any way I can get the Universal dependencies using python, or nltk?I can only produce the parse tree.

Example:

Input sentence:

My dog also likes eating sausage.

Output:

Universal dependencies

nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)

解决方案

Wordseer's stanford-corenlp-python fork is a good start as it works with the recent CoreNLP release (3.5.2). However it will give you raw output, which you need manually transform. For example, given you have the wrapper running:

>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.')))  # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
                                   [u'nsubj',
                                    u'loves',
                                    u'2',
                                    u'John',
                                    u'1'],
                                   [u'dobj', u'loves', u'2', u'Mary', u'3'],
                                   [u'punct', u'loves', u'2', u'.', u'4']],
                 u'parsetree': [],
                 u'text': u'John loves Mary.',
                 u'words': [[u'John',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'John',
                              u'PartOfSpeech': u'NNP'}],
                            [u'loves',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'love',
                              u'PartOfSpeech': u'VBZ'}],
                            [u'Mary',
                             {u'CharacterOffsetBegin': u'11',
                              u'CharacterOffsetEnd': u'15',
                              u'Lemma': u'Mary',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'15',
                              u'CharacterOffsetEnd': u'16',
                              u'Lemma': u'.',
                              u'PartOfSpeech': u'.'}]]}]}

In case you want to use dependency parser, you can reuse NLTK's DependencyGraph with a bit of effort

>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
...    server.parse(
...       'John loves Mary. '
...       'I saw a man with a telescope. '
...       'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
...    )
... )['sentences']
>>>
>>> def transform(sentence):
...     for rel, _, head, word, n in sentence['dependencies']:
...         n = int(n)
...
...         word_info = sentence['words'][n - 1][1]
...         tag = word_info['PartOfSpeech']
...         lemma = word_info['Lemma']
...         if rel == 'root':
...             # NLTK expects that the root relation is labelled as ROOT!
...             rel = 'ROOT'
...
...         # Hack: Return values we don't know as '_'.
...         #       Also, consider tag and ctag to be equal.
...         # n is used to sort words as they appear in the sentence.
...         yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
...     DependencyGraph(
...         ' '.join(items)  # NLTK expects an iterable of strings...
...         for n, *items in sorted(transform(parse))
...     )
...     for parse in parses
... ]
>>>
>>> # Play around with the information we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
 (('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
 (('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4))  # doctest: +NORMALIZE_WHITESPACE
Ballmer     NNP     4       nsubj
has         VBZ     4       aux
been        VBN     4       cop
vocal       JJ      0       ROOT
in          IN      4       prep
the         DT      8       det
past        JJ      8       amod
warning     NN      5       pobj
that        WDT     13      dobj
Linux       NNP     13      nsubj
is          VBZ     13      cop
a           DT      13      det
threat      NN      8       rcmod
to          TO      13      prep
Microsoft   NNP     14      pobj
.           .       4       punct
<BLANKLINE>

Setting up CoreNLP is not that hard, check http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html for more details.

这篇关于斯坦福大学对 Python NLTK 的通用依赖的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆