句子结构识别-虚假 [英] Sentence Structure identification - spacy

查看:118
本文介绍了句子结构识别-虚假的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算使用spacy和textacy来识别英语中的句子结构.

I intend to identify the sentence structure in English using spacy and textacy.

例如: 猫坐在垫子上-SVO,猫跳了起来,拿起了饼干-SVV0. 猫吃了饼干和饼干. -SVOO.

For example: The cat sat on the mat - SVO , The cat jumped and picked up the biscuit - SVV0. The cat ate the biscuit and cookies. - SVOO.

该程序应该读取一个段落,并以SVO,SVOO,SVVO或其他自定义结构的形式返回每个句子的输出.

The program is supposed to read a paragraph and return the output for each sentence as SVO, SVOO, SVVO or other custom structures.

到目前为止的努力:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"] 
VERB = ["ROOT"] 
OBJ = ["dobj", "pobj", "dobj"] 
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)

输出:

(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])

  • 问题1:SVO被覆盖.为什么?
  • 问题2:如何将句子识别为SVOO SVO SVVO等?
    • Issue 1: The SVO are overwritten. Why?
    • Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?
    • 一些我正在概念化的方法.

      Some approach I was conceptualizing.

      from __future__ import unicode_literals
      import spacy,en_core_web_sm
      import textacy
      nlp = en_core_web_sm.load()
      sentence = 'I will go to the mall.'
      doc = nlp(sentence)
      chk_set = set(['PRP','MD','NN'])
      result = chk_set.issubset(t.tag_ for t in doc)
      if result == False:
          print "SVO not identified"
      elif result == True: # shouldn't do this
          print "SVO"
      else:
          print "Others..."
      

      进一步发展

      from __future__ import unicode_literals
      import spacy,en_core_web_sm
      import textacy
      nlp = en_core_web_sm.load()
      sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
      doc = nlp(sentence)
      print(" ".join([token.dep_ for token in doc]))
      

      当前输出:

      det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct

      det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct

      预期输出:

      SVO SVVO SVOO
      

      想法是将依赖项标签分解为简单的主语-动词和宾语模型.

      Idea is to break down dependency tags to simple subject-verb and object model.

      如果没有其他可用选项,则可以考虑使用正则表达式来实现.但这是我的最后选择.

      Thinking of achieving it with regex if no other options are available. But that is my last option.

      修改3:

      研究了此链接后,得到了一些改善.

      After studying this link, got some improvement.

      def testSVOs():
          nlp = en_core_web_sm.load()
          tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
          svos = findSVOs(tok)
          print(svos)
      

      当前输出:

      [(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
      


      预期输出:

      我希望句子有一个记号.尽管我能够提取SVO的信息,然后将其转换为SVO表示法.它更多是模式识别,而不是句子内容本身.

      I am expecting a notation for the sentences. Although I'm able to extract the SVO on how to convert it into SVO notation. It is more of pattern identification rather than the sentence content itself.

      SVO SVO SVOO
      

      推荐答案

      问题1:SVO被覆盖.为什么?

      Issue 1: The SVO are overwritten. Why?

      这是textacy问题.这部分效果不佳,请参见此博客

      This is textacy issue. This part is not working very well, see this blog

      问题2:如何将句子识别为SVOO SVO SVVO等?

      Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?

      您应该解析依赖关系树. SpaCy提供了这些信息,您只需要编写一组规则即可使用.head.left.right.children属性将其提取出来.

      You should parse the dependency tree. SpaCy provides the information, you just need to write a set of rules to extract it out, using .head, .left, .right and .children attributes.

      >>for word in text: 
          print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))
      
              The    DT        det        DET cat 
              cat    NN      nsubj       NOUN sat 
              sat   VBD       ROOT       VERB sat 
               on    IN       prep        ADP sat 
              the    DT        det        DET mat
              mat    NN       pobj       NOUN on 
                .     .      punct      PUNCT sat 
               of    IN       ROOT        ADP of 
              the    DT        det        DET lab
              art    NN   compound       NOUN lab
              lab    NN       pobj       NOUN of 
                .     .      punct      PUNCT of 
              The    DT        det        DET cat 
              cat    NN      nsubj       NOUN jumped 
           jumped   VBD       ROOT       VERB jumped 
              and    CC         cc      CCONJ jumped 
           picked   VBD       conj       VERB jumped 
               up    RP        prt       PART picked 
              the    DT        det        DET biscuit
          biscuit    NN       dobj       NOUN picked 
                .     .      punct      PUNCT jumped 
              The    DT        det        DET cat 
              cat    NN      nsubj       NOUN ate 
              ate   VBD       ROOT       VERB ate 
          biscuit    NN       dobj       NOUN ate 
              and    CC         cc      CCONJ biscuit 
          cookies   NNS       conj       NOUN biscuit 
                .     .      punct      PUNCT ate 
      

      我建议您查看以下代码 ,只需将pobj添加到OBJECTS的列表中,即可覆盖SVO和SVOO.稍微摆弄一下就可以得到SVVO.

      I recommend you look at this code, just add pobj to the list of OBJECTS, and you will get your SVO and SVOO covered. With a little fiddling you can get SVVO also.

      这篇关于句子结构识别-虚假的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆