在 keras 模型中使用预训练的词嵌入? [英] Using pre-trained word embeddings in a keras model?

查看:24
本文介绍了在 keras 模型中使用预训练的词嵌入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在关注这个github 来自 keras 团队的关于如何使用预训练词嵌入的代码.我能够理解其中的大部分内容,但我对矢量大小有疑问.我希望有人能帮助我.

I was following this github code from keras team on how to use pre-trained word embeddings. I was able to understand most of it but I've a doubt regarding vector sizes. I was hoping someone could help me out.

首先我们定义Tokenizer(num_words=MAX_NUM_WORDS)

根据 keras docs for Tokenizer() num_words 参数只考虑 MAX_NUM_WORDS - 1 所以如果 MAX_NUM_WORDS=20000 我大概有 19999 个字.

Accoding to keras docs forTokenizer() num_words argument only consider MAX_NUM_WORDS - 1 so if MAX_NUM_WORDS=20000 I'll have around 19999 words.

num_words:根据单词保留的最大单词数频率.只保留最常见的 num_words-1 个单词.

num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

接下来在代码中,我们准备了一个基于手套向量的Embedding Matrix.这样做时,我们正在考虑一个大小为 (20001, 100)np.zeros((MAX_NUM_WORDS+1, 100)) 的矩阵.如果我们的词汇表中只有 19999 个单词,我不明白为什么要考虑 20001 的矩阵.

Next in the code we prepare a Embedding Matrix based on glove vectors. When doing that we are considering an matrix of size (20001, 100)np.zeros((MAX_NUM_WORDS+1, 100)). I couldn't get why we are consider a matrix of 20001 if we have only 19999 words in our vocabulary.

然后我们将 num_words 传递给嵌入层.根据 input_dim 参数的嵌入层文档,它说,

Also then we are passing num_words to the Embedding layer. According to Embedding layer docs for input_dim argument, It says,

input_dim: int > 0. 词汇表的大小,即最大整数索引+ 1.

input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

embedding_layer = Embedding(input_dim=num_words,
                            output_dim=EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
trainable=False)

根据 Tokenizer() 函数,这里我们的词汇量大小将是 19999 对吗?那么为什么我们将 20001 作为 input_dim

Here our vocabulary size will be 19999 according to Tokenizer() function right? So why are we passing 20001 as input_dim

这是取自该 github 链接的一小段代码.

Here's a small snippet of the code taken from that github link.

MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIR = 100

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# prepare embedding matrix
num_words = MAX_NUM_WORDS + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

推荐答案

我认为你的怀疑是有道理的.更改是在 nocoder>ofellow> 上的noTokenizer 使其保留 num_words 个单词而不是 num_words - 1 个单词.但是 Tokenizer 的这个变化是

I think your doubt is valid. The change was made in this commit of the code to keep the word with index = MAX_NUM_WORDS. Before that there was a commit on Tokenizer to make it keep num_words words instead of num_words - 1 words. But this change of Tokenizer was reverted afterwards. So I guess the author of the example update might have assumed that Tokenizer kept num_words words when the update was committed.

这篇关于在 keras 模型中使用预训练的词嵌入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆