Gensim Fasttext预训练模型如何获取词汇外单词的向量? [英] How does the Gensim Fasttext pre-trained model get vectors for out-of-vocabulary words?

查看:270
本文介绍了Gensim Fasttext预训练模型如何获取词汇外单词的向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用gensim加载预训练的快速文本模型.我从fasttext 网站下载了受英语维基百科训练的模型.

I am using gensim to load pre-trained fasttext model. I downloaded the English wikipedia trained model from fasttext website.

这是我编写的用于加载预训练模型的代码:

here is the code I wrote to load the pre-trained model:

from gensim.models import FastText as ft
model=ft.load_fasttext_format("wiki.en.bin")

我尝试检查人声中是否存在以下短语(由于这是经过预先训练的模型,因此这种可能性极小).

I try to check if the following phrase exists in the vocal(which rare chance it would as these are pre-trained model).

print("internal executive" in model.wv.vocab)
print("internal executive" in model.wv)

False
True

因此在词汇表中没有短语"internal Executive",但是我们仍然有相应的单词vector.

So the phrase "internal executive" is not present in the vocabulary but we still have the word vector corresponding to that.

model.wv["internal executive"]
Out[46]:
array([ 0.0210917 , -0.15233646, -0.1173932 , -0.06210957, -0.07288644,
       -0.06304111,  0.07833624, -0.17026938, -0.21922196,  0.01146349,
       -0.13639058,  0.17283678, -0.09251394, -0.17875175,  0.01339212,
       -0.26683623,  0.05487974, -0.11843193, -0.01982722,  0.37037706,
       -0.24370994,  0.14269598, -0.16363597,  0.00328478, -0.16560239,
       -0.1450972 , -0.24787527, -0.01318423,  0.03277111,  0.16175713,
       -0.19367714,  0.16955379,  0.1972683 ,  0.09044111,  0.01731548,
       -0.0034324 , -0.04834719,  0.14321515,  0.01422525, -0.08803893,
       -0.29411593, -0.1033244 ,  0.06278021,  0.16452256,  0.0650492 ,
        0.1506474 , -0.14194389,  0.10778475,  0.16008648, -0.07853138,
        0.2183501 , -0.25451994, -0.0345991 , -0.28843886,  0.19964759,
       -0.10923116,  0.26665714, -0.02544454,  0.30637854,  0.04568949,
       -0.04798719, -0.05769338,  0.25762403, -0.05158515, -0.04426906,
       -0.19901046,  0.00894193, -0.17269588, -0.24747233, -0.19061406,
        0.14322804, -0.10804397,  0.4002605 ,  0.01409482, -0.04675362,
        0.10039093,  0.07260711, -0.0938239 , -0.20434211,  0.05741301,
        0.07592541, -0.02921724,  0.21137556, -0.23188967, -0.23164661,
       -0.4569614 ,  0.07434579,  0.10841205, -0.06514647,  0.01220404,
        0.02679767,  0.11840229,  0.2247431 , -0.1946325 , -0.0990666 ,
       -0.02524677,  0.0801085 ,  0.02437297,  0.00674876,  0.02088535,
        0.21464555, -0.16240154,  0.20670174, -0.21640894,  0.03900698,
        0.21772243,  0.01954809,  0.04541844,  0.18990673,  0.11806394,
       -0.21336791, -0.10871669, -0.02197789, -0.13249406, -0.20440844,
        0.1967368 ,  0.09804545,  0.1440366 , -0.08401451, -0.03715726,
        0.27826542, -0.25195453, -0.16737154,  0.3561183 , -0.15756823,
        0.06724873, -0.295487  ,  0.28395334, -0.04908851,  0.09448399,
        0.10877471, -0.05020981, -0.24595442, -0.02822314,  0.17862654,
        0.06452435, -0.15105674, -0.31911567,  0.08166212,  0.2634299 ,
        0.17043628,  0.10063848,  0.0687021 , -0.12210461,  0.10803893,
        0.13644943,  0.10755012, -0.09816817,  0.11873955, -0.03881042,
        0.18548298, -0.04769253, -0.01511982, -0.08552645, -0.05218676,
        0.05387992,  0.0497043 ,  0.06922272, -0.0089245 ,  0.24790663,
        0.27209425, -0.04925154, -0.08621719,  0.15918174,  0.25831223,
        0.01654229, -0.03617229, -0.13490392,  0.08033483,  0.34922174,
       -0.01744722, -0.16894792, -0.10506647,  0.21708378, -0.22582002,
        0.15625793, -0.10860757, -0.06058934, -0.25798836, -0.20142137,
       -0.06613475, -0.08779443, -0.10732629,  0.05967236, -0.02455976,
        0.2229451 , -0.19476262, -0.2720119 ,  0.03687386, -0.01220259,
        0.07704347, -0.1674307 ,  0.2400516 ,  0.07338555, -0.2000631 ,
        0.13897157, -0.04637206, -0.00874449, -0.32827383, -0.03435039,
        0.41587186,  0.04643605,  0.03352945, -0.13700874,  0.16430037,
       -0.13630766, -0.18546128, -0.04692861,  0.37308362, -0.30846512,
        0.5535561 , -0.11573419,  0.2332801 , -0.07236694, -0.01018955,
        0.05936847,  0.25877884, -0.2959846 , -0.13610311,  0.10905041,
       -0.18220575,  0.06902339, -0.10624941,  0.33002165, -0.12087796,
        0.06742091,  0.20762768, -0.34141317,  0.0884434 ,  0.11247049,
        0.14748637,  0.13261876, -0.07357208, -0.11968047, -0.22124515,
        0.12290633,  0.16602683,  0.01055585,  0.04445777, -0.11142147,
        0.00004863,  0.22543314, -0.14342701, -0.23209116, -0.00003538,
        0.19272381, -0.13767233,  0.04850799, -0.281997  ,  0.10343244,
        0.16510887,  0.08671653, -0.24125539,  0.01201926,  0.0995285 ,
        0.09807415, -0.06764816, -0.0206733 ,  0.04697794,  0.02000999,
        0.05817033,  0.10478792,  0.0974884 , -0.01756372, -0.2466861 ,
        0.02877498,  0.02499748, -0.00370895, -0.04728201,  0.00107118,
       -0.21848503,  0.2033032 , -0.00076264,  0.03828803, -0.2929495 ,
       -0.18218371,  0.00628893,  0.20586628,  0.2410889 ,  0.02364616,
       -0.05220835, -0.07040054, -0.03744286, -0.06718048,  0.19264086,
       -0.06490505,  0.27364203,  0.05527219, -0.27494466,  0.22256687,
        0.10330909, -0.3076979 ,  0.04852265,  0.07411488,  0.23980476,
        0.1590279 , -0.26712465,  0.07580928,  0.05644221, -0.18824042],

现在,我的困惑是Fastext也为单词的字符ngram创建矢量.因此,对于单词内部",它将为其所有字符ngram(包括完整单词)创建矢量,然后该单词的最终单词矢量为其字符ngram的总和.

Now my confusion is that Fastext creates vectors for character ngrams of a word too. So for a word "internal" it will create vectors for all its character ngrams including the full word and then the final word vector for the word is the sum of its character ngrams.

但是,如何仍然可以给我一个单词甚至整个句子的矢量? fastext向量不是用于一个单词及其ngram吗?那么当这个短语明显有两个单词时,我会看到这些矢量?

However, how it is still able to give me vector of a word or even the whole sentence? Isn't fastext vector is for a word and its ngram? So what are these vector I am seeing for the phrase when its clearly two words?

推荐答案

从本文使用子词信息丰富词向量:

在大型未标记语料库上训练的连续单词表示形式对许多自然语言处理任务很有用.通过为每个单词分配不同的向量,学习此类表示形式的流行模型会忽略单词的形态.这是一个限制,特别是对于具有大量词汇和许多稀有单词的语言而言.在本文中,我们提出了一种基于skipgram模型的新方法,其中每个单词都表示为一包字符n-gram.向量表示与每个字符n-gram相关;单词表示为这些表示的总和.

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

因此,词汇清晰的单词表示为字符ngram向量的总和.目的是处理诸如"blargfizzle"之类的不清晰词(unks),但它也处理诸如您的输入之类的短语.

So out-of-vocab words are represented as the sum of character ngram vectors. While the intent is to handle out-of-vocab words (unks) like "blargfizzle", it also handles phrases like your input.

如果您查看实现向量在Gensim中,您确实可以看到它的作用(以及归一化和散列等)-我添加了一些以XXX开头的注释:

If you look at the implementation of the vectors in Gensim you can see this is indeed what it's doing (along with normalization and hashing etc) - I added some comments starting with XXX:

def word_vec(self, word, use_norm=False):
    """
    Accept a single word as input.
    Returns the word's representations in vector space, as a 1D numpy array.
    If `use_norm` is True, returns the normalized word vector.
    """
    if word in self.vocab:
        # XXX in-vocab terms return with a simple lookup
        return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
    else:
        # from gensim.models.fasttext import compute_ngrams
        # XXX Initialize the vector for the unk
        word_vec = np.zeros(self.vectors_ngrams.shape[1], dtype=np.float32)
        ngrams = _compute_ngrams(word, self.min_n, self.max_n)
        if use_norm:
            ngram_weights = self.vectors_ngrams_norm
        else:
            ngram_weights = self.vectors_ngrams
        ngrams_found = 0
        for ngram in ngrams:
            ngram_hash = _ft_hash(ngram) % self.bucket
            if ngram_hash in self.hash2index:
                # XXX add the vector for the ngram to the unk vector
                word_vec += ngram_weights[self.hash2index[ngram_hash]]
                ngrams_found += 1
        if word_vec.any():
            return word_vec / max(1, ngrams_found)
        else:  # No ngrams of the word are present in self.ngrams
            raise KeyError('all ngrams for word %s absent from model' % word)

请注意,这并不意味着它可以为任意字符串提供矢量-它仍然需要在unk中包含至少某些ngram的数据,因此类似xwkxwkzrw天爾遠波的字符串可能会失败如果您的媒介经过了英语培训,则返回任何内容.

Note that this doesn't mean it can provide vectors for any arbitrary string - it still needs to have data for at least some of the ngrams in an unk, so a string like xwkxwkzrw or 天爾遠波 will probably fail to return anything if your vectors are trained on English.

这篇关于Gensim Fasttext预训练模型如何获取词汇外单词的向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆