如何在Python中获得二进制解析 [英] How to get a binary parse in Python

查看:98
本文介绍了如何在Python中获得二进制解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有自然语言推理语料库的数据( SNLI

I have data from natural language inference corpora (SNLI, multiNLI) that comes in this form:

'( ( Two ( blond women ) ) ( ( are ( hugging ( one another ) ) ) . ) )'

它们应该是二叉树(有些不是很干净).

They are supposed to be a binary trees (some are not very clean).

我想将自己的某些句子解析为这种格式.我该如何使用NLTK或类似的软件呢?

I want to parse some of my own sentences into this format. How can I do that with NLTK or similar?

我已经找到了StanfordParser,但是我却找不到如何获得这种解析的方法.

I have found the StanfordParser, but I have not been able to find how to get this kind of a parse.

推荐答案

任何树都可以转换为保留其成分的二叉树.这是一个适用于nltk.Tree输入的简单解决方案:

Any tree can be converted to a binary tree that preserves its constituents. Here's a simple solution that works on nltk.Tree input:

from nltk import Tree
from functools import reduce

def binarize(tree):
    """
    Recursively turn a tree into a binary tree.
    """
    if isinstance(tree, str):
        return tree
    elif len(tree) == 1:
        return binarize(tree[0])
    else:
        label = tree.label()
        return reduce(lambda x, y: Tree(label, (binarize(x), binarize(y))), tree)

如果要使用普通元组而不是Tree,请用以下内容替换最后一个return语句:

If you want ordinary tuples instead of Tree, replace the last return statement with this:

return reduce(lambda x, y: (binarize(x), binarize(y)), tree)

示例:

>>> t = Tree.fromstring('''(ROOT (S (NP (NNP Oracle))
    (VP (VBD had) (VP (VBN fought) (S (VP (TO to)
      (VP (VB keep) (NP (DT the) (NNS forms))
        (PP (IN from) (S (VP (VBG being) (VP (VBN released))))))))))))''')

>>> bt = binarize(t)

>>> print(t)
(ROOT
  (S
    (NP (NNP Oracle))
    (VP
      (VBD had)
      (VP
        (VBN fought)
        (S
          (VP
            (TO to)
            (VP
              (VB keep)
              (NP (DT the) (NNS forms))
              (PP (IN from) (S (VP (VBG being) (VP (VBN released))))))))))))
>>> print(bt)
(S
  Oracle
  (VP
    had
    (VP
      fought
      (VP
        to
        (VP (VP keep (NP the forms)) (PP from (VP being released)))))))

这将确保二进制结构,但不一定是正确结构.大覆盖率的解析器会生成非二进制分支,这是因为某些附件选择非常困难. (考虑经典的我用望远镜看见了那个女孩"; PP是带望远镜"在物体内部还是VP的一部分?)因此,请谨慎操作.

This will ensure binary structure, but it's not necessarily the correct structure. Large-coverage parsers generate non-binary branching because some attachment choices are notoriously hard. (Consider the classic "I saw the girl with the telescope"; is the PP "with the telescope" inside the object, or part of the VP?). So proceed with care.

这篇关于如何在Python中获得二进制解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆