[orth,pos,tag,lema和text的空间文档 [英] spaCy Documentation for [ orth , pos , tag, lema and text ]
问题描述
我是SpaCy的新手.我添加了这篇文章作为文档,并简化了我的入门工作.
I am new to spaCy. I added this post for documentation and make it simple for new starters as me.
import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
print(word.orth_)
我想了解orth,lemma,tag和pos的含义是什么?这段代码还打印出了print(word)
与print(word.orth_)
I am looking to understand what the meaning of orth, lemma, tag and pos ? This code print out the values also what the different between print(word)
vs print(word.orth_)
推荐答案
orth,引理,标记和pos的含义是什么?
请参见 https://spacy.io/docs/usage/pos-tagging #pos-schemes
print(word)与print(word.orth_)有什么区别
简而言之:
word.orth_
和word.text
相同. cython属性以下划线结尾,这通常是开发人员并不真正希望向用户公开的变量.
word.orth_
and word.text
are the same. The fact that the cython property ends with an underscore, it's usually a variable that the developers didn't really want to expose to the user.
简而言之:
在 https中访问word.orth_
属性时://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 ,它尝试访问保存所有单词的索引:
When you access the word.orth_
property at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537, it tries to access the index of where all the vocabulary of words are kept:
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
(有关详细信息,请参见下面的 In long
了解self.c.lex.orth
)
(For details, see In long
below for explanation of self.c.lex.orth
)
word.text
返回仅包裹orth_
属性的单词的字符串表示形式,请参见
And word.text
returns the string representation of the word which merely wraps around the orth_
property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128
property text:
def __get__(self):
return self.orth_
当您打印print(word)
时,它会调用__repr__
dunder函数,该函数返回word.__unicode__
或word.__byte__
并指向word.text
变量,请参见
And when you're printing print(word)
, it calls the __repr__
dunder function that returns the word.__unicode__
or word.__byte__
which points back to the word.text
variable, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
def __hash__(self):
return hash((self.doc, self.i))
def __len__(self):
"""
Number of unicode characters in token.text.
"""
return self.c.lex.length
def __unicode__(self):
return self.text
def __bytes__(self):
return self.text.encode('utf8')
def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()
def __repr__(self):
return self.__str__()
冗长:
让我们尝试逐步完成此步骤:
Let's try to walk through this step by step:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>
将句子传递到nlp()
函数后,它会生成 spacy.tokens.doc.Doc
对象,来自文档:
After the sentence is passed into the nlp()
function, it produces a spacy.tokens.doc.Doc
object, from the docs:
cdef class Doc:
"""
A sequence of `Token` objects. Access sentences and named entities,
export annotations to numpy arrays, losslessly serialize to compressed
binary strings.
Aside: Internals
The `Doc` object holds an array of `TokenC` structs.
The Python-level `Token` and `Span` objects are views of this
array, i.e. they don't own the data themselves.
Code: Construction 1
doc = nlp.tokenizer(u'Some text')
Code: Construction 2
doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
"""
所以spacy.tokens.doc.Doc
对象是的序列spacy.tokens.token.Token
对象.在Token
对象中,我们看到一波列举的cython property
,例如在 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
So the spacy.tokens.doc.Doc
object is a sequence of spacy.tokens.token.Token
object. Within the Token
object, we see a wave of cython property
enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth:
def __get__(self):
return self.c.lex.orth
追溯到此,我们看到self.c = &self.doc.c[offset]
:
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
没有详尽的文档,我们真的不知道self.c
的含义,但是从它的外观来看,它正在访问&self.doc
引用中指向传递到__cinit__
的Doc doc
的标记之一.功能.因此很可能是访问令牌的捷径
Without thorough documentation, we don't really know what self.c
means but from the looks of it it's accessing one of the tokens within the &self.doc
reference pointing to the Doc doc
that was passed into the __cinit__
function. So most probably, it's a short cut to access the tokens
查看Doc.c
:
cdef class Doc:
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING
现在,我们看到Doc.c
指向一个cython指针数组data_start
,该数组分配了用于存储spacy.tokens.doc.Doc
对象的内存(如果我对<TokenC*>
的解释不正确,请更正).
Now we see that the Doc.c
is referring to a cython pointer array data_start
that allocates the memory on to store the spacy.tokens.doc.Doc
object (please correct me if I get the explanation <TokenC*>
wrong).
因此,回到self.c = &self.doc.c[offset]
,它基本上是在尝试访问存储数组的存储点,更具体地说是访问数组中的第offset"项.
So going back to self.c = &self.doc.c[offset]
, it's basically trying to access the memory point where the array is stored and more specifically accessing the "offset-th" item in the array.
这就是spacy.tokens.token.Token
.
回到property
:
property orth:
def __get__(self):
return self.c.lex.orth
我们看到self.c.lex
正在访问 spacy.tokens.doc.Doc
和self.c.lex.orth
中的data_start[i].lex
只是一个整数,表示保存在spacy.tokens.doc.Doc
内部词汇表中的单词出现的索引.
We see that the self.c.lex
is accessing the data_start[i].lex
from spacy.tokens.doc.Doc
and self.c.lex.orth
is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.tokens.doc.Doc
internal vocabulary.
因此,我们看到property orth_
尝试从self.c.lex.orth
Thus, we see the property orth_
tries to access the self.vocab.strings
with te index from self.c.lex.orth
https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
这篇关于[orth,pos,tag,lema和text的空间文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!