从Spacy到Conll格式，无需使用Spacy的句子分割器 [英] Spacy to Conll format without using Spacy's sentence splitter

查看：259 发布时间：2020/10/20 19:27:19 python-2.7 dependencies customization spacy

本文介绍了从Spacy到Conll格式，无需使用Spacy的句子分割器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

此文章介绍了如何获取Conll格式的文本块的相关性与Spacy的标记器一起使用。这是发布的解决方案：

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:

import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
        for i, word in enumerate(sent):
              if word.head == word:
                 head_idx = 0
              else:
                 head_idx = word.head.i - sent[0].i + 1
              print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
                 i+1, # There's a word.i attr that's position in *doc*
                  word,
                  word.lemma_,
                  word.tag_, # Fine-grained tag
                  word.ent_type_,
                  str(head_idx),
                  word.dep_ # Relation
                 ))

它输出以下代码块：

1   Bob bob NNP PERSON  2   nsubj
2   bought  buy VBD     0   ROOT
3   the the DT      4   det
4   pizza   pizza   NN      2   dobj
5   to  to  IN      2   dative
6   Alice   alice   NNP PERSON  5   pobj

我想不使用 doc.sents 获得相同的输出。

I would like to get the same output WITHOUT using doc.sents.

确实，我有自己的句子拆分器。我想使用它，然后一次给Spacy一句话以获取POS，NER和依赖项。

Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.

如何使用Spacy获取Conll格式的一句话的POS，NER和依赖性，而不必使用Spacy的句子分割器？

How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?

推荐答案

在 sPacy 中的文档是可迭代的，并且文档中的内容指出，它会迭代 Token s

A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens

 |  __iter__(...)
 |      Iterate over `Token`  objects, from which the annotations can be
 |      easily accessed. This is the main way of accessing `Token` objects,
 |      which are the main way annotations are accessed from Python. If faster-
 |      than-Python speeds are required, you can instead access the annotations
 |      as a numpy array, or access the underlying C data directly from Cython.
 |      
 |      EXAMPLE:
 |          >>> for token in doc

因此，我相信您只需要制作一个文档对于每个拆分的句子，然后执行以下操作：

Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:

def printConll(split_sentence_text):
    doc = nlp(split_sentence_text)
    for i, word in enumerate(doc):
          if word.head == word:
             head_idx = 0
          else:
             head_idx = word.head.i - sent[0].i + 1
          print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
             i+1, # There's a word.i attr that's position in *doc*
              word,
              word.lemma_,
              word.tag_, # Fine-grained tag
              word.ent_type_,
              str(head_idx),
              word.dep_ # Relation
             ))

当然，按照CoNLL格式，您必须在每个句子后打印换行符。

Of course, following the CoNLL format you would have to print a newline after each sentence.

这篇关于从Spacy到Conll格式，无需使用Spacy的句子分割器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Spacy到Conll格式，无需使用Spacy的句子分割器 [英] Spacy to Conll format without using Spacy's sentence splitter

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从Spacy到Conll格式，无需使用Spacy的句子分割器 [英] Spacy to Conll format without using Spacy&#39;s sentence splitter

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

从Spacy到Conll格式，无需使用Spacy的句子分割器 [英] Spacy to Conll format without using Spacy's sentence splitter

登录关闭