从Spacy到Conll格式,无需使用Spacy的句子分割器 [英] Spacy to Conll format without using Spacy's sentence splitter
问题描述
此文章介绍了如何获取Conll格式的文本块的相关性与Spacy的标记器一起使用。这是发布的解决方案:
This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:
import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
它输出以下代码块:
1 Bob bob NNP PERSON 2 nsubj
2 bought buy VBD 0 ROOT
3 the the DT 4 det
4 pizza pizza NN 2 dobj
5 to to IN 2 dative
6 Alice alice NNP PERSON 5 pobj
我想不使用 doc.sents
获得相同的输出。
I would like to get the same output WITHOUT using doc.sents
.
确实,我有自己的句子拆分器。我想使用它,然后一次给Spacy一句话以获取POS,NER和依赖项。
Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.
如何使用Spacy获取Conll格式的一句话的POS,NER和依赖性,而不必使用Spacy的句子分割器?
How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?
推荐答案
在 sPacy
中的文档
是可迭代的,并且文档中的内容指出,它会迭代 Token
s
A Document
in sPacy
is iterable, and in the documentation is states that it iterates over Token
s
| __iter__(...)
| Iterate over `Token` objects, from which the annotations can be
| easily accessed. This is the main way of accessing `Token` objects,
| which are the main way annotations are accessed from Python. If faster-
| than-Python speeds are required, you can instead access the annotations
| as a numpy array, or access the underlying C data directly from Cython.
|
| EXAMPLE:
| >>> for token in doc
因此,我相信您只需要制作一个文档
对于每个拆分的句子,然后执行以下操作:
Therefore I believe you would just have to make a Document
for each of your sentences that are split, then do something like the following:
def printConll(split_sentence_text):
doc = nlp(split_sentence_text)
for i, word in enumerate(doc):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
当然,按照CoNLL格式,您必须在每个句子后打印换行符。
Of course, following the CoNLL format you would have to print a newline after each sentence.
这篇关于从Spacy到Conll格式,无需使用Spacy的句子分割器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!