从Spacy到Conll格式,无需使用Spacy的句子分割器 [英] Spacy to Conll format without using Spacy's sentence splitter

查看:259
本文介绍了从Spacy到Conll格式,无需使用Spacy的句子分割器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文章介绍了如何获取Conll格式的文本块的相关性与Spacy的标记器一起使用。这是发布的解决方案:

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:

import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
        for i, word in enumerate(sent):
              if word.head == word:
                 head_idx = 0
              else:
                 head_idx = word.head.i - sent[0].i + 1
              print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
                 i+1, # There's a word.i attr that's position in *doc*
                  word,
                  word.lemma_,
                  word.tag_, # Fine-grained tag
                  word.ent_type_,
                  str(head_idx),
                  word.dep_ # Relation
                 ))

它输出以下代码块:

1   Bob bob NNP PERSON  2   nsubj
2   bought  buy VBD     0   ROOT
3   the the DT      4   det
4   pizza   pizza   NN      2   dobj
5   to  to  IN      2   dative
6   Alice   alice   NNP PERSON  5   pobj

我想不使用 doc.sents 获得相同的输出。

I would like to get the same output WITHOUT using doc.sents.

确实,我有自己的句子拆分器。我想使用它,然后一次给Spacy一句话以获取POS,NER和依赖项。

Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.

如何使用Spacy获取Conll格式的一句话的POS,NER和依赖性,而不必使用Spacy的句子分割器?

How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?

推荐答案

sPacy 中的文档是可迭代的,并且文档中的内容指出,它会迭代 Token s

A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens

 |  __iter__(...)
 |      Iterate over `Token`  objects, from which the annotations can be
 |      easily accessed. This is the main way of accessing `Token` objects,
 |      which are the main way annotations are accessed from Python. If faster-
 |      than-Python speeds are required, you can instead access the annotations
 |      as a numpy array, or access the underlying C data directly from Cython.
 |      
 |      EXAMPLE:
 |          >>> for token in doc

因此,我相信您只需要制作一个文档对于每个拆分的句子,然后执行以下操作:

Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:

def printConll(split_sentence_text):
    doc = nlp(split_sentence_text)
    for i, word in enumerate(doc):
          if word.head == word:
             head_idx = 0
          else:
             head_idx = word.head.i - sent[0].i + 1
          print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
             i+1, # There's a word.i attr that's position in *doc*
              word,
              word.lemma_,
              word.tag_, # Fine-grained tag
              word.ent_type_,
              str(head_idx),
              word.dep_ # Relation
             ))

当然,按照CoNLL格式,您必须在每个句子后打印换行符。

Of course, following the CoNLL format you would have to print a newline after each sentence.

这篇关于从Spacy到Conll格式,无需使用Spacy的句子分割器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆