Gensim Fasttext 预训练模型如何获取词汇表外单词的向量? [英] How does the Gensim Fasttext pre-trained model get vectors for out-of-vocabulary words?

查看:68
本文介绍了Gensim Fasttext 预训练模型如何获取词汇表外单词的向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 gensim 加载预训练的 fasttext 模型.我从 fasttext 网站下载了英语维基百科训练模型.

I am using gensim to load pre-trained fasttext model. I downloaded the English wikipedia trained model from fasttext website.

这是我编写的加载预训练模型的代码:

here is the code I wrote to load the pre-trained model:

from gensim.models import FastText as ft
model=ft.load_fasttext_format("wiki.en.bin")

我尝试检查人声中是否存在以下短语(这种情况很少见,因为这些是预训练的模型).

I try to check if the following phrase exists in the vocal(which rare chance it would as these are pre-trained model).

print("internal executive" in model.wv.vocab)
print("internal executive" in model.wv)

False
True

所以词汇表中不存在短语内部执行",但我们仍然有与之对应的词向量.

So the phrase "internal executive" is not present in the vocabulary but we still have the word vector corresponding to that.

model.wv["internal executive"]
Out[46]:
array([ 0.0210917 , -0.15233646, -0.1173932 , -0.06210957, -0.07288644,
       -0.06304111,  0.07833624, -0.17026938, -0.21922196,  0.01146349,
       -0.13639058,  0.17283678, -0.09251394, -0.17875175,  0.01339212,
       -0.26683623,  0.05487974, -0.11843193, -0.01982722,  0.37037706,
       -0.24370994,  0.14269598, -0.16363597,  0.00328478, -0.16560239,
       -0.1450972 , -0.24787527, -0.01318423,  0.03277111,  0.16175713,
       -0.19367714,  0.16955379,  0.1972683 ,  0.09044111,  0.01731548,
       -0.0034324 , -0.04834719,  0.14321515,  0.01422525, -0.08803893,
       -0.29411593, -0.1033244 ,  0.06278021,  0.16452256,  0.0650492 ,
        0.1506474 , -0.14194389,  0.10778475,  0.16008648, -0.07853138,
        0.2183501 , -0.25451994, -0.0345991 , -0.28843886,  0.19964759,
       -0.10923116,  0.26665714, -0.02544454,  0.30637854,  0.04568949,
       -0.04798719, -0.05769338,  0.25762403, -0.05158515, -0.04426906,
       -0.19901046,  0.00894193, -0.17269588, -0.24747233, -0.19061406,
        0.14322804, -0.10804397,  0.4002605 ,  0.01409482, -0.04675362,
        0.10039093,  0.07260711, -0.0938239 , -0.20434211,  0.05741301,
        0.07592541, -0.02921724,  0.21137556, -0.23188967, -0.23164661,
       -0.4569614 ,  0.07434579,  0.10841205, -0.06514647,  0.01220404,
        0.02679767,  0.11840229,  0.2247431 , -0.1946325 , -0.0990666 ,
       -0.02524677,  0.0801085 ,  0.02437297,  0.00674876,  0.02088535,
        0.21464555, -0.16240154,  0.20670174, -0.21640894,  0.03900698,
        0.21772243,  0.01954809,  0.04541844,  0.18990673,  0.11806394,
       -0.21336791, -0.10871669, -0.02197789, -0.13249406, -0.20440844,
        0.1967368 ,  0.09804545,  0.1440366 , -0.08401451, -0.03715726,
        0.27826542, -0.25195453, -0.16737154,  0.3561183 , -0.15756823,
        0.06724873, -0.295487  ,  0.28395334, -0.04908851,  0.09448399,
        0.10877471, -0.05020981, -0.24595442, -0.02822314,  0.17862654,
        0.06452435, -0.15105674, -0.31911567,  0.08166212,  0.2634299 ,
        0.17043628,  0.10063848,  0.0687021 , -0.12210461,  0.10803893,
        0.13644943,  0.10755012, -0.09816817,  0.11873955, -0.03881042,
        0.18548298, -0.04769253, -0.01511982, -0.08552645, -0.05218676,
        0.05387992,  0.0497043 ,  0.06922272, -0.0089245 ,  0.24790663,
        0.27209425, -0.04925154, -0.08621719,  0.15918174,  0.25831223,
        0.01654229, -0.03617229, -0.13490392,  0.08033483,  0.34922174,
       -0.01744722, -0.16894792, -0.10506647,  0.21708378, -0.22582002,
        0.15625793, -0.10860757, -0.06058934, -0.25798836, -0.20142137,
       -0.06613475, -0.08779443, -0.10732629,  0.05967236, -0.02455976,
        0.2229451 , -0.19476262, -0.2720119 ,  0.03687386, -0.01220259,
        0.07704347, -0.1674307 ,  0.2400516 ,  0.07338555, -0.2000631 ,
        0.13897157, -0.04637206, -0.00874449, -0.32827383, -0.03435039,
        0.41587186,  0.04643605,  0.03352945, -0.13700874,  0.16430037,
       -0.13630766, -0.18546128, -0.04692861,  0.37308362, -0.30846512,
        0.5535561 , -0.11573419,  0.2332801 , -0.07236694, -0.01018955,
        0.05936847,  0.25877884, -0.2959846 , -0.13610311,  0.10905041,
       -0.18220575,  0.06902339, -0.10624941,  0.33002165, -0.12087796,
        0.06742091,  0.20762768, -0.34141317,  0.0884434 ,  0.11247049,
        0.14748637,  0.13261876, -0.07357208, -0.11968047, -0.22124515,
        0.12290633,  0.16602683,  0.01055585,  0.04445777, -0.11142147,
        0.00004863,  0.22543314, -0.14342701, -0.23209116, -0.00003538,
        0.19272381, -0.13767233,  0.04850799, -0.281997  ,  0.10343244,
        0.16510887,  0.08671653, -0.24125539,  0.01201926,  0.0995285 ,
        0.09807415, -0.06764816, -0.0206733 ,  0.04697794,  0.02000999,
        0.05817033,  0.10478792,  0.0974884 , -0.01756372, -0.2466861 ,
        0.02877498,  0.02499748, -0.00370895, -0.04728201,  0.00107118,
       -0.21848503,  0.2033032 , -0.00076264,  0.03828803, -0.2929495 ,
       -0.18218371,  0.00628893,  0.20586628,  0.2410889 ,  0.02364616,
       -0.05220835, -0.07040054, -0.03744286, -0.06718048,  0.19264086,
       -0.06490505,  0.27364203,  0.05527219, -0.27494466,  0.22256687,
        0.10330909, -0.3076979 ,  0.04852265,  0.07411488,  0.23980476,
        0.1590279 , -0.26712465,  0.07580928,  0.05644221, -0.18824042],

现在我的困惑是 Fastext 也为单词的字符 ngram 创建了向量.因此,对于单词内部",它将为其所有字符 ngram 创建向量,包括整个单词,然后该单词的最终词向量是其字符 ngram 的总和.

Now my confusion is that Fastext creates vectors for character ngrams of a word too. So for a word "internal" it will create vectors for all its character ngrams including the full word and then the final word vector for the word is the sum of its character ngrams.

但是,它如何仍然能够给我一个词甚至整个句子的向量?fastext 向量不是用于单词及其 ngram 吗?那么当这个短语显然是两个词时,我看到的这些向量是什么?

However, how it is still able to give me vector of a word or even the whole sentence? Isn't fastext vector is for a word and its ngram? So what are these vector I am seeing for the phrase when its clearly two words?

推荐答案

来自论文 Enriching Word Vectors with Subword Information:

在大型未标记语料库上训练的连续词表示对于许多自然语言处理任务很有用.通过为每个单词分配一个不同的向量,学习这种表示的流行模型忽略了单词的形态.这是一个限制,特别是对于具有大量词汇和许多稀有词的语言.在本文中,我们提出了一种基于 skipgram 模型的新方法,其中每个单词都表示为一个字符 n-gram 包.向量表示与每个字符 n-gram 相关联;词被表示为这些表示的总和.

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.

因此,词汇外的单词表示为字符 ngram 向量的总和.虽然其目的是处理像blargfizzle"这样的词汇外的词(unks),但它也处理像你的输入这样的短语.

So out-of-vocab words are represented as the sum of character ngram vectors. While the intent is to handle out-of-vocab words (unks) like "blargfizzle", it also handles phrases like your input.

如果您查看 实施向量在 Gensim 中你可以看到这确实是它在做什么(以及规范化和散列等) - 我添加了一些以 XXX 开头的注释:

If you look at the implementation of the vectors in Gensim you can see this is indeed what it's doing (along with normalization and hashing etc) - I added some comments starting with XXX:

def word_vec(self, word, use_norm=False):
    """
    Accept a single word as input.
    Returns the word's representations in vector space, as a 1D numpy array.
    If `use_norm` is True, returns the normalized word vector.
    """
    if word in self.vocab:
        # XXX in-vocab terms return with a simple lookup
        return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
    else:
        # from gensim.models.fasttext import compute_ngrams
        # XXX Initialize the vector for the unk
        word_vec = np.zeros(self.vectors_ngrams.shape[1], dtype=np.float32)
        ngrams = _compute_ngrams(word, self.min_n, self.max_n)
        if use_norm:
            ngram_weights = self.vectors_ngrams_norm
        else:
            ngram_weights = self.vectors_ngrams
        ngrams_found = 0
        for ngram in ngrams:
            ngram_hash = _ft_hash(ngram) % self.bucket
            if ngram_hash in self.hash2index:
                # XXX add the vector for the ngram to the unk vector
                word_vec += ngram_weights[self.hash2index[ngram_hash]]
                ngrams_found += 1
        if word_vec.any():
            return word_vec / max(1, ngrams_found)
        else:  # No ngrams of the word are present in self.ngrams
            raise KeyError('all ngrams for word %s absent from model' % word)

请注意,这并不意味着它可以为任意字符串提供向量——它仍然需要至少有一些 unk 中的 ngrams 的数据,所以像 xwkxwkzrw天尔远波如果你的向量是用英语训练的,可能不会返回任何东西.

Note that this doesn't mean it can provide vectors for any arbitrary string - it still needs to have data for at least some of the ngrams in an unk, so a string like xwkxwkzrw or 天爾遠波 will probably fail to return anything if your vectors are trained on English.

这篇关于Gensim Fasttext 预训练模型如何获取词汇表外单词的向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆