使用斯坦福解析器提取子句 [英] Clause Extraction using Stanford parser

查看:23
本文介绍了使用斯坦福解析器提取子句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的句子,我需要把它分成主句和从句.例如对于句子
ABC 引用了许多国家禁止使用化学添加剂的事实,并认为在该州也可能禁止使用化学添加剂.
需要拆分

1)ABC引用事实2)许多国家禁止使用化学添加剂3) ABC 认为在这种情况下他们也可能被禁止.

我想我可以使用斯坦福解析器树或依赖项,但我不确定如何从这里开始.

<前>(根(S(NP (NNP ABC))(副总裁(VBZ 引用)(NP(DT)(NN 事实))(SBAR(在那)(S(NP(NN 化学品)(NNS 添加剂))(副总裁(VP (VBP)(副总裁(禁止VBN)(PP(进)(NP(JJ很多)(NNS国家)))))(抄送和)(VP(VBZ感觉)(SBAR(S(NP(PRP他们))(副总裁(MD 可能)(副总裁(VB是)(副总裁(禁止VBN)(PP(进)(NP(DT这个)(NN状态)))(ADVP (RB也是)))))))))))))(..)))

和折叠的依赖解析

<前>nsubj(cites-2, ABC-1)根(ROOT-0,cites-2)det(fact-4, the-3)dobj(引用-2,事实-4)标记(禁止-9,那个-5)nn(添加剂-7,化学-6)nsubjpass(banned-9,添加剂-7)nsubj(感觉-14,添加剂-7)辅助通行证(banned-9,are-8)ccomp(引用-2,禁止-9)amod(countries-12, many-11)prep_in(banned-9, countries-12)ccomp(引用-2,感觉-14)conj_and(禁止-9,感觉-14)nsubjpass(banned-18, they-15)辅助(禁止 18 年,5 月 16 日)auxpass(banned-18, be-17)ccomp(感觉-14,禁止-18)det(state-21, this-20)prep_in(banned-18, state-21)advmod(banned-18, too-22)

解决方案

如果您主要使用基于成分的解析树,而不是依赖项,那可能会更好.依赖项会有所帮助,但前提是主要工作完成后!我将在回答的最后解释这一点.

这是因为 constituency-parse 基于短语结构语法,如果您想从句子中提取子句,这是最相关的.也可以使用依赖项来完成,但在这种情况下,您实际上将重构短语结构——从根开始并查看依赖节点(例如 ABCfacts是动词cites的名义主语和直接宾语,依此类推......).

然而,可视化解析树是有帮助的.在您的示例中,子句由 SBAR 标记表示,该标记是由(可能为空的)从属连词引入的子句.您需要做的就是:

  1. 识别解析树中的非根子句节点
  2. 从主树中移除(但单独保留)以这些子句节点为根的子树.
  3. 在主树中(在步骤 2 中删除子树之后),删除所有悬垂介词、从属连词和副词.

在第 3 步中,我所说的悬挂"是指在第 2 步中删除了依赖项的任何介词等.例如,从ABC 引用事实"中,您需要删除介词/从属 -连接that",因为它的从属节点banned"在步骤 2 中被删除.因此你将拥有三个独立的子句:

  • 许多国家/地区禁止使用化学添加剂(步骤 2 中的 SBAR 去除)
  • 他们也可能在这种状态下被禁止(在第 2 步中删除 SBAR)
  • ABC 引用事实(第 3 步)

这里唯一的问题是连接ABC--感觉.为此,请注意banned"和feels"都是动词cites"的补语,因此具有相同的主语,即ABC"!你已经完成了.完成后,您将获得第四个子句ABC 感觉",您可能希望也可能不想将其包含在最终结果中.

有关所有条款标签的列表(实际上,还有所有 Penn Treebank 标签),请参阅此列表:,大多数人几乎从未正确解析解析器.

  • 如果子句是被动语态,您可能需要稍微修改算法.
  • 除了这三个陷阱之外,上述算法应该可以非常准确地工作.

    I have a complex sentence and I need to separate it into main and dependent clause. For example for the sentence
    ABC cites the fact that chemical additives are banned in many countries and feels they may be banned in this state too.
    The split required

    1)ABC cites the fact   
    2)chemical additives are banned in many countries   
    3)ABC feels they may be banned in this state too.    
    

    I think I could use the Stanford Parser tree or dependencies, but I am not sure how to proceed from here.

    The tree

    (ROOT
      (S
        (NP (NNP ABC))
        (VP (VBZ cites)
          (NP (DT the) (NN fact))
          (SBAR (IN that)
            (S
              (NP (NN chemical) (NNS additives))
              (VP
                (VP (VBP are)
                  (VP (VBN banned)
                    (PP (IN in)
                      (NP (JJ many) (NNS countries)))))
                (CC and)
                (VP (VBZ feels)
                  (SBAR
                    (S
                      (NP (PRP they))
                      (VP (MD may)
                        (VP (VB be)
                          (VP (VBN banned)
                            (PP (IN in)
                              (NP (DT this) (NN state)))
                            (ADVP (RB too))))))))))))
        (. .)))
    

    and the collapsed dependency parse

    nsubj(cites-2, ABC-1)  
    root(ROOT-0, cites-2)  
    det(fact-4, the-3)   
    dobj(cites-2, fact-4)  
    mark(banned-9, that-5)  
    nn(additives-7, chemical-6)  
    nsubjpass(banned-9, additives-7)   
    nsubj(feels-14, additives-7)   
    auxpass(banned-9, are-8)   
    ccomp(cites-2, banned-9)   
    amod(countries-12, many-11)  
    prep_in(banned-9, countries-12)   
    ccomp(cites-2, feels-14)    
    conj_and(banned-9, feels-14)    
    nsubjpass(banned-18, they-15)   
    aux(banned-18, may-16)    
    auxpass(banned-18, be-17)    
    ccomp(feels-14, banned-18)   
    det(state-21, this-20)    
    prep_in(banned-18, state-21)    
    advmod(banned-18, too-22)   
    

    解决方案

    It is probably better if you primarily use the constituenty-based parse tree, and not the dependencies. The dependencies will be helpful, but only after the main work is done! I am going to explain this towards the end of my answer.

    This is because constituency-parse is based on phrase structure grammar, which is the most relevant if you are seeking to extract clauses from a sentence. It can be done using dependencies as well, but in that case, you will essentially be reconstructing the phrase structure -- starting from the root and looking at dependent nodes (e.g. ABC and facts are the nominal subject and direct object of the verb cites, and so on ... ).

    It is helpful to visualize the parse tree, however. In your example, the clauses are indicated by the SBAR tag, which is a clause introduced by a (possibly empty) subordinating conjunction. All you need to do is the following:

    1. Identify the non-root clausal nodes in the parse tree
    2. Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
    3. In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.

    In step 3, what I mean by "hanging" is that any prepositions, etc. whose dependency has been removed in step 2. E.g., from "ABC cites the fact that", you need to remove the preposition/subordinating-conjunction "that" because its dependent node "banned" was removed in step 2. You will thus have three independent clauses:

    • chemical additives are banned in many countries (SBAR removal in step 2)
    • they may be banned in this state too (SBAR removal in step 2)
    • ABC cites the fact (step 3)

    The only issue here is the connection ABC--feels. For this, note that both "banned" and "feels" are complements of the verb "cites", and hence, have the same subject, which is "ABC"! And you're done. When this is done, you will get a fourth clause, "ABC feels", which is something you may or may not want to include in your final result.

    For a list of all clausal tags (and, in fact, all Penn Treebank tags), see this list: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

    For an online parse-tree visualization, you may want to use the online Berkeley parser demo. It helps a lot in forming a better intuition. Here's the image generated for your example sentence:

    Caveats

    1. Even the best parsers will not always parse sentences correctly, so keep that in mind.
    2. Additionally, many complex sentences involve right node raising, which is almost never parsed correctly by most parsers.
    3. You may need to modify the algorithm a little if a clause is in passive voice.

    Apart from these three pitfalls, the above algorithm should work quite accurately.

    这篇关于使用斯坦福解析器提取子句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆