使用斯坦福解析器进行子句提取 [英] Clause Extraction using Stanford parser

查看:152
本文介绍了使用斯坦福解析器进行子句提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的句子,需要将其分为主要和从句. 例如,句子
美国广播公司(ABC)引用了许多国家禁止使用化学添加剂的事实,并认为在该州也可能禁止使用化学添加剂.
需要分割

I have a complex sentence and I need to separate it into main and dependent clause. For example for the sentence
ABC cites the fact that chemical additives are banned in many countries and feels they may be banned in this state too.
The split required

1)ABC cites the fact   
2)chemical additives are banned in many countries   
3)ABC feels they may be banned in this state too.    

我认为我可以使用Stanford Parser树或依赖项,但是我不确定如何从此处继续.

I think I could use the Stanford Parser tree or dependencies, but I am not sure how to proceed from here.


(ROOT
  (S
    (NP (NNP ABC))
    (VP (VBZ cites)
      (NP (DT the) (NN fact))
      (SBAR (IN that)
        (S
          (NP (NN chemical) (NNS additives))
          (VP
            (VP (VBP are)
              (VP (VBN banned)
                (PP (IN in)
                  (NP (JJ many) (NNS countries)))))
            (CC and)
            (VP (VBZ feels)
              (SBAR
                (S
                  (NP (PRP they))
                  (VP (MD may)
                    (VP (VB be)
                      (VP (VBN banned)
                        (PP (IN in)
                          (NP (DT this) (NN state)))
                        (ADVP (RB too))))))))))))
    (. .)))

和折叠后的依赖项解析


nsubj(cites-2, ABC-1)  
root(ROOT-0, cites-2)  
det(fact-4, the-3)   
dobj(cites-2, fact-4)  
mark(banned-9, that-5)  
nn(additives-7, chemical-6)  
nsubjpass(banned-9, additives-7)   
nsubj(feels-14, additives-7)   
auxpass(banned-9, are-8)   
ccomp(cites-2, banned-9)   
amod(countries-12, many-11)  
prep_in(banned-9, countries-12)   
ccomp(cites-2, feels-14)    
conj_and(banned-9, feels-14)    
nsubjpass(banned-18, they-15)   
aux(banned-18, may-16)    
auxpass(banned-18, be-17)    
ccomp(feels-14, banned-18)   
det(state-21, this-20)    
prep_in(banned-18, state-21)    
advmod(banned-18, too-22)   

推荐答案

如果您主要使用基于组成部分的解析树而不是依赖项,则可能会更好.依赖项将是有帮助的,但是只有在完成主要工作之后!我将在回答结束时对此进行解释.

It is probably better if you primarily use the constituenty-based parse tree, and not the dependencies. The dependencies will be helpful, but only after the main work is done! I am going to explain this towards the end of my answer.

这是因为constituency-parse基于短语结构语法,如果您要从句子中提取从句,则这是最相关的.也可以使用依赖项来完成,但是在这种情况下,您实际上将在重建短语结构-从根开始并查看依赖节点(例如ABCfacts是该词的标称主语和直接宾语)动词cites,依此类推...).

This is because constituency-parse is based on phrase structure grammar, which is the most relevant if you are seeking to extract clauses from a sentence. It can be done using dependencies as well, but in that case, you will essentially be reconstructing the phrase structure -- starting from the root and looking at dependent nodes (e.g. ABC and facts are the nominal subject and direct object of the verb cites, and so on ... ).

但是,可视化分析树很有帮助.在您的示例中,子句由 SBAR 标记指示,该标记是由(可能为空)从属连词引入的子句.您需要做的是以下事情:

It is helpful to visualize the parse tree, however. In your example, the clauses are indicated by the SBAR tag, which is a clause introduced by a (possibly empty) subordinating conjunction. All you need to do is the following:

  1. 识别分析树中的非根子句
  2. 从主树中删除(但单独保留)以这些子节点为根的子树.
  3. 在主树中(在步骤2中删除了子树之后),删除所有 hanging 介词,从属连词和副词.
  1. Identify the non-root clausal nodes in the parse tree
  2. Remove (but retain separately) the subtrees rooted at these clausal nodes from the main tree.
  3. In the main tree (after removal of subtrees in step 2), remove any hanging prepositions, subordinating conjunctions and adverbs.

在第3步中,挂起"的意思是在第2步中删除了依赖项的任何介词等.例如,在"ABC引用事实"中,您需要删除介词/从属词-连词"that",因为其从属节点"banned"已在步骤2中被删除.因此,您将拥有三个独立的子句:

In step 3, what I mean by "hanging" is that any prepositions, etc. whose dependency has been removed in step 2. E.g., from "ABC cites the fact that", you need to remove the preposition/subordinating-conjunction "that" because its dependent node "banned" was removed in step 2. You will thus have three independent clauses:

  • 许多国家/地区禁止使用化学添加剂(在第2步中删除了SBAR)
  • 在此状态下也可能将其禁止(在第2步中删除了SBAR)
  • ABC引用了事实(第3步)

这里唯一的问题是连接 ABC -感觉.为此,请注意,"banned"和"feel"都是动词"cites"的补语,因此具有相同的主语,即"ABC"!这样就完成了.完成此操作后,您将获得第四个子句"ABC感觉",这是您可能希望或不希望在最终结果中包括的内容.

The only issue here is the connection ABC--feels. For this, note that both "banned" and "feels" are complements of the verb "cites", and hence, have the same subject, which is "ABC"! And you're done. When this is done, you will get a fourth clause, "ABC feels", which is something you may or may not want to include in your final result.

有关所有子句标签(实际上是所有Penn Treebank标签)的列表,请参阅以下列表:

For a list of all clausal tags (and, in fact, all Penn Treebank tags), see this list: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

对于联机分析树可视化,您可能需要使用在线伯克利解析器演示.它有助于形成更好的直觉.这是为您的例句生成的图像:

For an online parse-tree visualization, you may want to use the online Berkeley parser demo. It helps a lot in forming a better intuition. Here's the image generated for your example sentence:

注意事项

  1. 即使最好的解析器也不会总是正确地解析句子,所以请记住这一点.
  2. 此外,许多复杂的句子还涉及右节点抬起,大多数情况下几乎都无法正确解析解析器.
  3. 如果子句是被动语态,您可能需要稍微修改算法.
  1. Even the best parsers will not always parse sentences correctly, so keep that in mind.
  2. Additionally, many complex sentences involve right node raising, which is almost never parsed correctly by most parsers.
  3. You may need to modify the algorithm a little if a clause is in passive voice.

除了这三个陷阱之外,上述算法应该可以非常准确地工作.

Apart from these three pitfalls, the above algorithm should work quite accurately.

这篇关于使用斯坦福解析器进行子句提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆