如何在Python中获得二进制解析 [英] How to get a binary parse in Python
问题描述
我有自然语言推理语料库的数据( SNLI ,
I have data from natural language inference corpora (SNLI, multiNLI) that comes in this form:
'( ( Two ( blond women ) ) ( ( are ( hugging ( one another ) ) ) . ) )'
它们应该是二叉树(有些不是很干净).
They are supposed to be a binary trees (some are not very clean).
我想将自己的某些句子解析为这种格式.我该如何使用NLTK或类似的软件呢?
I want to parse some of my own sentences into this format. How can I do that with NLTK or similar?
我已经找到了StanfordParser,但是我却找不到如何获得这种解析的方法.
I have found the StanfordParser, but I have not been able to find how to get this kind of a parse.
推荐答案
任何树都可以转换为保留其成分的二叉树.这是一个适用于nltk.Tree
输入的简单解决方案:
Any tree can be converted to a binary tree that preserves its constituents. Here's a simple solution that works on nltk.Tree
input:
from nltk import Tree
from functools import reduce
def binarize(tree):
"""
Recursively turn a tree into a binary tree.
"""
if isinstance(tree, str):
return tree
elif len(tree) == 1:
return binarize(tree[0])
else:
label = tree.label()
return reduce(lambda x, y: Tree(label, (binarize(x), binarize(y))), tree)
如果要使用普通元组而不是Tree
,请用以下内容替换最后一个return
语句:
If you want ordinary tuples instead of Tree
, replace the last return
statement with this:
return reduce(lambda x, y: (binarize(x), binarize(y)), tree)
示例:
>>> t = Tree.fromstring('''(ROOT (S (NP (NNP Oracle))
(VP (VBD had) (VP (VBN fought) (S (VP (TO to)
(VP (VB keep) (NP (DT the) (NNS forms))
(PP (IN from) (S (VP (VBG being) (VP (VBN released))))))))))))''')
>>> bt = binarize(t)
>>> print(t)
(ROOT
(S
(NP (NNP Oracle))
(VP
(VBD had)
(VP
(VBN fought)
(S
(VP
(TO to)
(VP
(VB keep)
(NP (DT the) (NNS forms))
(PP (IN from) (S (VP (VBG being) (VP (VBN released))))))))))))
>>> print(bt)
(S
Oracle
(VP
had
(VP
fought
(VP
to
(VP (VP keep (NP the forms)) (PP from (VP being released)))))))
这将确保二进制结构,但不一定是正确结构.大覆盖率的解析器会生成非二进制分支,这是因为某些附件选择非常困难. (考虑经典的我用望远镜看见了那个女孩"; PP是带望远镜"在物体内部还是VP的一部分?)因此,请谨慎操作.
This will ensure binary structure, but it's not necessarily the correct structure. Large-coverage parsers generate non-binary branching because some attachment choices are notoriously hard. (Consider the classic "I saw the girl with the telescope"; is the PP "with the telescope" inside the object, or part of the VP?). So proceed with care.
这篇关于如何在Python中获得二进制解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!