PyTorch:将词向量加载到 Field 词汇表与嵌入层 [英] PyTorch: Loading word vectors into Field vocabulary vs. Embedding layer

查看:21
本文介绍了PyTorch:将词向量加载到 Field 词汇表与嵌入层的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从 Keras 来到 PyTorch.我想创建一个 PyTorch 嵌入层(大小为 V x D 的矩阵,其中 V 在词汇索引和 D 是嵌入向量维度)与 GloVe 向量,但对所需步骤感到困惑.

I'm coming from Keras to PyTorch. I would like to create a PyTorch Embedding layer (a matrix of size V x D, where V is over vocabulary word indices and D is the embedding vector dimension) with GloVe vectors but am confused by the needed steps.

在 Keras 中,你可以通过让嵌入层构造函数采用 weights 参数来加载 GloVe 向量:

In Keras, you can load the GloVe vectors by having the Embedding layer constructor take a weights argument:

# Keras code.
embedding_layer = Embedding(..., weights=[embedding_matrix])

在查看 PyTorch 和 TorchText 库时,我发现嵌入应该加载两次,一次在 Field 中,然后在 Embedding层.这是示例代码 我发现:

When looking at PyTorch and the TorchText library, I see that the embeddings should be loaded twice, once in a Field and then again in an Embedding layer. Here is sample code that I found:

# PyTorch code.

# Create a field for text and build a vocabulary with 'glove.6B.100d'
# pretrained embeddings.
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)

TEXT.build_vocab(train_data, vectors='glove.6B.100d')


# Build an RNN model with an Embedding layer.
class RNN(nn.Module):
    def __init__(self, ...):

        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        ...

# Initialize the embedding layer with the Glove embeddings from the
# vocabulary. Why are two steps needed???
model = RNN(...)
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

特别是:

  1. 为什么 GloVe 嵌入除了 Embedding 之外还要加载到 Field 中?
  2. 我认为 Field 函数 build_vocab() 只是从训练数据中构建词汇表.在此步骤中,GloVe 嵌入如何涉及?
  1. Why are the GloVe embeddings loaded in a Field in addition to the Embedding?
  2. I thought the Field function build_vocab() just builds its vocabulary from the training data. How are the GloVe embeddings involved here during this step?

以下是其他没有回答我的问题的 StackOverflow 问题:

Here are other StackOverflow questions that did not answer my questions:

PyTorch/Gensim - 如何加载预训练词嵌入

在 pytorch 中嵌入

PyTorch LSTM - 使用词嵌入代替 nn.Embedding()

感谢您的帮助.

推荐答案

torchtext 构建词汇表时,它会将标记索引与嵌入对齐.如果您的词汇表与预训练嵌入的大小和顺序不同,则不能保证索引匹配,因此您可能会查找不正确的嵌入.build_vocab() 使用相应的嵌入为您的数据集创建词汇表,并丢弃其余的嵌入,因为这些未使用.

When torchtext builds the vocabulary, it aligns the the token indices with the embedding. If your vocabulary doesn't have the same size and ordering as the pre-trained embeddings, the indices wouldn't be guaranteed to match, therefore you might look up incorrect embeddings. build_vocab() creates the vocabulary for your dataset with the corresponding embeddings and discards the rest of the embeddings, because those are unused.

GloVe-6B 嵌入包含大小为 400K 的词汇表.例如,IMDB 数据集 只使用了其中的大约 120K,其他280K 未使用.

The GloVe-6B embeddings includes a vocabulary of size 400K. For example the IMDB dataset only uses about 120K of these, the other 280K are unused.

import torch
from torchtext import data, datasets, vocab

TEXT = data.Field(tokenize='spacy', include_lengths=True)
LABEL = data.LabelField()

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train_data, vectors='glove.6B.100d')

TEXT.vocab.vectors.size() # => torch.Size([121417, 100])

# For comparison the full GloVe
glove = vocab.GloVe(name="6B", dim=100)
glove.vectors.size() # => torch.Size([400000, 100])

# Embedding of the first token is not the same
torch.equal(TEXT.vocab.vectors[0], glove.vectors[0]) # => False

# Index of the word "the"
TEXT.vocab.stoi["the"] # => 2
glove.stoi["the"] # => 0

# Same embedding when using the respective index of the same word
torch.equal(TEXT.vocab.vectors[2], glove.vectors[0]) # => True

在使用嵌入构建词汇表后,输入序列将在标记化版本中给出,其中每个标记由其索引表示.在模型中您要使用这些嵌入,因此您需要创建嵌入层,但要使用您的词汇表的嵌入.最简单和推荐的方法是 nn.Embedding.from_pretrained,本质上和Keras版本一样.

After having built the vocabulary with its embeddings, the input sequences will be given in the tokenised version where each token is represented by its index. In the model you want to use the embedding of these, so you need to create the embedding layer, but with the embeddings of your vocabulary. The easiest and recommended way is nn.Embedding.from_pretrained, which is essentially the same as the Keras version.

embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors)

# Or if you want to make it trainable
trainable_embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors, freeze=False)

您没有提到 embedding_matrix 是如何在 Keras 版本中创建的,也没有提到如何构建词汇表以使其可以与 embedding_matrix 一起使用.如果您手动(或使用任何其他实用程序)执行此操作,则根本不需要 torchtext,并且您可以像在 Keras 中一样初始化嵌入.torchtext 纯粹是为了方便常见的数据相关任务.

You didn't mention how the embedding_matrix is created in the Keras version, nor how the vocabulary is built such that it can be used with the embedding_matrix. If you do that by hand (or with any other utility), you don't need torchtext at all, and you can initialise the embeddings just like in Keras. torchtext is purely for convenience for common data related tasks.

这篇关于PyTorch:将词向量加载到 Field 词汇表与嵌入层的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆