如何从nltk.trees中识别和删除跟踪树? [英] How to identify and remove trace trees from nltk.trees?

查看:116
本文介绍了如何从nltk.trees中识别和删除跟踪树?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我得到了这棵小树(显然只是一棵子树):

For example I got this small tree (which is obviously a subtree only):

(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))

迹线树是通向形状为*.*的叶子的那些树. 我现在要删除所有属于跟踪树的子树.因此,对于此示例,结果应如下所示:

Trace trees are those trees leading to a leaf of the shape: *.*. I now want to remove all subtrees which are a trace tree. So for this example the result should look like this:

(VP (VBZ says))

到目前为止,我提取了所有这些叶子:

So far I extracted all those leaves:

from nltk.tree import ParentedTree
import re

traceLeaves = []    

line = "( (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))))"
currTree = ParentedTree.fromstring(line, remove_empty_top_bracketing = True)
for leaf in currTree.leaves():
    if re.search('\*', leaf):
        traceLeaves.append(leaf)

但是我不知道如何导航树,直到存在一个没有跟踪树的同级并将跟踪树从原始树中删除为止. 因为我只开始使用nltk,所以我完全被困在这里...

but I got no idea how to navigate up the tree until there exists a sibling which is no trace tree and remove the trace tree from the original tree. I'm completely stuck here since I only started working with nltk...

这是我希望能够处理的一句话:

Here is one complete sentence I want to be able to process:

((SINV(S-3(S(NP-SBJ-1(-NONE- * PRO *))(VP(VBG假设)(NP(DT that))(NN post))(PP-TMP(IN at)(NP(NP(DT the)(NN年龄))(PP(IN of)(NP(CD 35)))))))()()(NP-SBJ-1(PRP he))(VP (VBD管理)(PP-MNR(IN)(NP(NN共识)))(,,)(SBAR-ADV(IN as)(S(NP-SBJ(-NONE- * PRO *))(VP( VBZ是)(NP-PRD(DT the)(NN规则))(PP-LOC(IN in)(NP(NNS大学)))))))))))))))))))()()(VPB(VBZ说)(SBAR( -NONE- * 0 *)(S-3(-NONE- * T *))))(NP-SBJ(NP(NNP沃伦)(NNP H.)(NNP Strother))(,,)(NP(NP (DT a)(NN大学)(NN官方))(SBAR(WHNP-2(WP who))(S(NP-SBJ-2(-NONE- * T *))(VP(VBZ is)(VP( VBG正在研究)(NP(NP(DT a)(NN书))(PP(IN上)(NP(NNP先生)(NNP Hahn)))))))))))() >

( (SINV (S-3 (S (NP-SBJ-1 (-NONE- *PRO*)) (VP (VBG Assuming) (NP (DT that) (NN post)) (PP-TMP (IN at) (NP (NP (DT the) (NN age)) (PP (IN of) (NP (CD 35))))))) (, ,) (NP-SBJ-1 (PRP he)) (VP (VBD managed) (PP-MNR (IN by) (NP (NN consensus))) (, ,) (SBAR-ADV (IN as) (S (NP-SBJ (-NONE- *PRO*)) (VP (VBZ is) (NP-PRD (DT the) (NN rule)) (PP-LOC (IN in) (NP (NNS universities)))))))) (, ,) (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))) (NP-SBJ (NP (NNP Warren) (NNP H.) (NNP Strother)) (, ,) (NP (NP (DT a) (NN university) (NN official)) (SBAR (WHNP-2 (WP who)) (S (NP-SBJ-2 (-NONE- *T*)) (VP (VBZ is) (VP (VBG researching) (NP (NP (DT a) (NN book)) (PP (IN on) (NP (NNP Mr.) (NNP Hahn)))))))))) (. .)) )

推荐答案

叶子是常规字符串,因此它们对导航树没有帮助.扫描树并查找高度为2的子树.

Leaves are regular strings, so they're no help for navigating the tree. Scan the tree and look for subtrees of height 2 instead.

要识别应删除的内容,请注意,nltk树是一种列表.因此,要查看一个节点有多少个子节点,只需取其len()即可.当您找到一个跟踪叶时,只要父级只有一个孩子,就将其向上移动.然后删除子树.这应该涵盖了所有内容,因为如果一个节点控制了两个跟踪树,而没有别的,则在删除第一个:-)

To recognize what should be deleted, note that an nltk tree is a kind of list; so to see how many children a node has, just take its len(). When you find a trace leaf, move up the tree as long as the parent only has one child; then delete the subtree. This should cover everything since if a node dominates two trace trees and nothing else, it will contain only one after you delete the first :-)

以下还有另一个技巧:删除节点会混淆for循环,因为分支列表变短了.为了防止事物在删除之前移动,我们向后扫描树.

The following has one more trick: Deleting a node confuses the for-loop, since the list of branches gets shorter. To prevent things from moving before we delete them, we scan the tree backwards.

for sub in reversed(list(t.subtrees())):
    if sub.height() == 2 and sub[0].startswith("*"):  # abbreviated test
        parent = sub.parent()
        while parent and len(parent) == 1:
            sub = parent
            parent = sub.parent()
        print(sub, "will be deleted")
        del t[sub.treeposition()]

这篇关于如何从nltk.trees中识别和删除跟踪树?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆