在 keras 模型中使用预训练的词嵌入? [英] Using pre-trained word embeddings in a keras model?
问题描述
我正在关注这个github
来自 keras 团队的关于如何使用预训练词嵌入的代码.我能够理解其中的大部分内容,但我对矢量大小有疑问.我希望有人能帮助我.
I was following this github
code from keras team on how to use pre-trained word embeddings. I was able to understand most of it but I've a doubt regarding vector sizes. I was hoping someone could help me out.
首先我们定义Tokenizer(num_words=MAX_NUM_WORDS)
根据 keras docs for Tokenizer()
num_words 参数只考虑 MAX_NUM_WORDS - 1
所以如果 MAX_NUM_WORDS=20000
我大概有 19999
个字.
Accoding to keras docs forTokenizer()
num_words argument only consider MAX_NUM_WORDS - 1
so if MAX_NUM_WORDS=20000
I'll have around 19999
words.
num_words:根据单词保留的最大单词数频率.只保留最常见的 num_words-1 个单词.
num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
接下来在代码中,我们准备了一个基于手套向量的Embedding Matrix
.这样做时,我们正在考虑一个大小为 (20001, 100)np.zeros((MAX_NUM_WORDS+1, 100))
的矩阵.如果我们的词汇表中只有 19999
个单词,我不明白为什么要考虑 20001
的矩阵.
Next in the code we prepare a Embedding Matrix
based on glove vectors. When doing that we are considering an matrix of size (20001, 100)np.zeros((MAX_NUM_WORDS+1, 100))
. I couldn't get why we are consider a matrix of 20001
if we have only 19999
words in our vocabulary.
然后我们将 num_words
传递给嵌入层.根据 input_dim 参数的嵌入层文档,它说,
Also then we are passing num_words
to the Embedding layer. According to Embedding layer docs for input_dim argument, It says,
input_dim: int > 0. 词汇表的大小,即最大整数索引+ 1.
input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
embedding_layer = Embedding(input_dim=num_words,
output_dim=EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
根据 Tokenizer()
函数,这里我们的词汇量大小将是 19999
对吗?那么为什么我们将 20001
作为 input_dim
Here our vocabulary size will be 19999
according to Tokenizer()
function right? So why are we passing 20001
as input_dim
这是取自该 github 链接的一小段代码.
Here's a small snippet of the code taken from that github link.
MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIR = 100
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
# prepare embedding matrix
num_words = MAX_NUM_WORDS + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i > MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
推荐答案
我认为你的怀疑是有道理的.更改是在 nocoder>ofellow> 上的noTokenizer 使其保留 num_words
个单词而不是 num_words - 1
个单词.但是 Tokenizer
的这个变化是
I think your doubt is valid. The change was made in this commit of the code to keep the word with index = MAX_NUM_WORDS
. Before that there was a commit on Tokenizer
to make it keep num_words
words instead of num_words - 1
words. But this change of Tokenizer
was reverted afterwards. So I guess the author of the example update might have assumed that Tokenizer
kept num_words
words when the update was committed.
这篇关于在 keras 模型中使用预训练的词嵌入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!