在 keras 模型中使用预训练的词嵌入? [英] Using pre-trained word embeddings in a keras model?

查看：24 发布时间：2021/9/5 19:21:55 python python-3.x tensorflow keras

本文介绍了在 keras 模型中使用预训练的词嵌入?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在关注这个github 来自 keras 团队的关于如何使用预训练词嵌入的代码.我能够理解其中的大部分内容，但我对矢量大小有疑问.我希望有人能帮助我.

I was following this github code from keras team on how to use pre-trained word embeddings. I was able to understand most of it but I've a doubt regarding vector sizes. I was hoping someone could help me out.

首先我们定义Tokenizer(num_words=MAX_NUM_WORDS)

根据 keras docs for Tokenizer() num_words 参数只考虑 MAX_NUM_WORDS - 1 所以如果 MAX_NUM_WORDS=20000 我大概有 19999 个字.

Accoding to keras docs forTokenizer() num_words argument only consider MAX_NUM_WORDS - 1 so if MAX_NUM_WORDS=20000 I'll have around 19999 words.

num_words:根据单词保留的最大单词数频率.只保留最常见的 num_words-1 个单词.

num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

接下来在代码中，我们准备了一个基于手套向量的Embedding Matrix.这样做时，我们正在考虑一个大小为 (20001, 100)np.zeros((MAX_NUM_WORDS+1, 100)) 的矩阵.如果我们的词汇表中只有 19999 个单词，我不明白为什么要考虑 20001 的矩阵.

Next in the code we prepare a Embedding Matrix based on glove vectors. When doing that we are considering an matrix of size (20001, 100)np.zeros((MAX_NUM_WORDS+1, 100)). I couldn't get why we are consider a matrix of 20001 if we have only 19999 words in our vocabulary.

然后我们将 num_words 传递给嵌入层.根据 input_dim 参数的嵌入层文档，它说，

Also then we are passing num_words to the Embedding layer. According to Embedding layer docs for input_dim argument, It says,

input_dim: int > 0. 词汇表的大小，即最大整数索引+ 1.

input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

embedding_layer = Embedding(input_dim=num_words,
                            output_dim=EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
trainable=False)

根据 Tokenizer() 函数，这里我们的词汇量大小将是 19999 对吗?那么为什么我们将 20001 作为 input_dim

Here our vocabulary size will be 19999 according to Tokenizer() function right? So why are we passing 20001 as input_dim

这是取自该 github 链接的一小段代码.

Here's a small snippet of the code taken from that github link.

MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIR = 100

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# prepare embedding matrix
num_words = MAX_NUM_WORDS + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

在 keras 模型中使用预训练的词嵌入? [英] Using pre-trained word embeddings in a keras model?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 keras 模型中使用预训练的词嵌入? [英] Using pre-trained word embeddings in a keras model?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭