[orth,pos,tag,lema和text的空间文档 [英] spaCy Documentation for [ orth , pos , tag, lema and text ]

查看:145
本文介绍了[orth,pos,tag,lema和text的空间文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是SpaCy的新手.我添加了这篇文章作为文档,并简化了我的入门工作.

I am new to spaCy. I added this post for documentation and make it simple for new starters as me.

import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
    print(word.orth_)

我想了解orth,lemma,tag和pos的含义是什么?这段代码还打印出了print(word)print(word.orth_)

I am looking to understand what the meaning of orth, lemma, tag and pos ? This code print out the values also what the different between print(word) vs print(word.orth_)

推荐答案

orth,引理,标记和pos的含义是什么?

请参见 https://spacy.io/docs/usage/pos-tagging #pos-schemes

print(word)与print(word.orth_)有什么区别

简而言之:

word.orth_word.text相同. cython属性以下划线结尾,这通常是开发人员并不真正希望向用户公开的变量.

word.orth_ and word.text are the same. The fact that the cython property ends with an underscore, it's usually a variable that the developers didn't really want to expose to the user.

简而言之:

https中访问word.orth_属性时://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537 ,它尝试访问保存所有单词的索引:

When you access the word.orth_ property at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537, it tries to access the index of where all the vocabulary of words are kept:

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

(有关详细信息,请参见下面的 In long 了解self.c.lex.orth)

(For details, see In long below for explanation of self.c.lex.orth)

word.text返回仅包裹orth_属性的单词的字符串表示形式,请参见

And word.text returns the string representation of the word which merely wraps around the orth_ property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
    def __get__(self):
        return self.orth_

当您打印print(word)时,它会调用__repr__ dunder函数,该函数返回word.__unicode__word.__byte__并指向word.text变量,请参见

And when you're printing print(word), it calls the __repr__ dunder function that returns the word.__unicode__ or word.__byte__ which points back to the word.text variable, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

    def __hash__(self):
        return hash((self.doc, self.i))

    def __len__(self):
        """
        Number of unicode characters in token.text.
        """
        return self.c.lex.length

    def __unicode__(self):
        return self.text

    def __bytes__(self):
        return self.text.encode('utf8')

    def __str__(self):
        if is_config(python3=True):
            return self.__unicode__()
        return self.__bytes__()

    def __repr__(self):
        return self.__str__()


冗长:

让我们尝试逐步完成此步骤:

Let's try to walk through this step by step:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>

将句子传递到nlp()函数后,它会生成 spacy.tokens.doc.Doc 对象,来自文档:

After the sentence is passed into the nlp() function, it produces a spacy.tokens.doc.Doc object, from the docs:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """

所以spacy.tokens.doc.Doc对象是的序列spacy.tokens.token.Token 对象.在Token对象中,我们看到一波列举的cython property,例如在 https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

So the spacy.tokens.doc.Doc object is a sequence of spacy.tokens.token.Token object. Within the Token object, we see a wave of cython property enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth:
    def __get__(self):
        return self.c.lex.orth

追溯到此,我们看到self.c = &self.doc.c[offset]:

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

没有详尽的文档,我们真的不知道self.c的含义,但是从它的外观来看,它正在访问&self.doc引用中指向传递到__cinit__Doc doc的标记之一.功能.因此很可能是访问令牌的捷径

Without thorough documentation, we don't really know what self.c means but from the looks of it it's accessing one of the tokens within the &self.doc reference pointing to the Doc doc that was passed into the __cinit__ function. So most probably, it's a short cut to access the tokens

查看Doc.c:

cdef class Doc:
    def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
        self.vocab = vocab
        size = 20
        self.mem = Pool()
        # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
        # However, we need to remember the true starting places, so that we can
        # realloc.
        data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
        cdef int i
        for i in range(size + (PADDING*2)):
            data_start[i].lex = &EMPTY_LEXEME
            data_start[i].l_edge = i
            data_start[i].r_edge = i
        self.c = data_start + PADDING

现在,我们看到Doc.c指向一个cython指针数组data_start,该数组分配了用于存储spacy.tokens.doc.Doc对象的内存(如果我对<TokenC*>的解释不正确,请更正).

Now we see that the Doc.c is referring to a cython pointer array data_start that allocates the memory on to store the spacy.tokens.doc.Doc object (please correct me if I get the explanation <TokenC*> wrong).

因此,回到self.c = &self.doc.c[offset],它基本上是在尝试访问存储数组的存储点,更具体地说是访问数组中的第offset"项.

So going back to self.c = &self.doc.c[offset], it's basically trying to access the memory point where the array is stored and more specifically accessing the "offset-th" item in the array.

这就是spacy.tokens.token.Token.

回到property:

property orth:
    def __get__(self):
        return self.c.lex.orth

我们看到self.c.lex正在访问 spacy.tokens.doc.Doc self.c.lex.orth中的data_start[i].lex只是一个整数,表示保存在spacy.tokens.doc.Doc内部词汇表中的单词出现的索引.

We see that the self.c.lex is accessing the data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.tokens.doc.Doc internal vocabulary.

因此,我们看到property orth_尝试从self.c.lex.orth

Thus, we see the property orth_ tries to access the self.vocab.strings with te index from self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

这篇关于[orth,pos,tag,lema和text的空间文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆