PyTorch:将单词向量加载到领域词汇与嵌入层中 [英] PyTorch: Loading word vectors into Field vocabulary vs. Embedding layer

查看:114
本文介绍了PyTorch:将单词向量加载到领域词汇与嵌入层中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要从Keras到PyTorch.我想创建一个PyTorch嵌入层(大小为 V x D 的矩阵,其中 V 位于词汇单词索引和D 是具有GloVe向量的嵌入向量维,但被所需的步骤弄糊涂了.

I'm coming from Keras to PyTorch. I would like to create a PyTorch Embedding layer (a matrix of size V x D, where V is over vocabulary word indices and D is the embedding vector dimension) with GloVe vectors but am confused by the needed steps.

在Keras中,您可以通过让嵌入层构造函数采用 weights 参数来加载 GloVe 向量:

In Keras, you can load the GloVe vectors by having the Embedding layer constructor take a weights argument:

# Keras code.
embedding_layer = Embedding(..., weights=[embedding_matrix])

当查看PyTorch和TorchText库时,我发现应该两次加载嵌入,一次在 Field 中,然后再次在 Embedding 层.这是示例代码我发现:

When looking at PyTorch and the TorchText library, I see that the embeddings should be loaded twice, once in a Field and then again in an Embedding layer. Here is sample code that I found:

# PyTorch code.

# Create a field for text and build a vocabulary with 'glove.6B.100d'
# pretrained embeddings.
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)

TEXT.build_vocab(train_data, vectors='glove.6B.100d')


# Build an RNN model with an Embedding layer.
class RNN(nn.Module):
    def __init__(self, ...):

        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        ...

# Initialize the embedding layer with the Glove embeddings from the
# vocabulary. Why are two steps needed???
model = RNN(...)
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

特别是:

  1. 为什么在 Embedding 之外又将GloVe嵌入加载到 Field 中?
  2. 我认为 Field 函数 build_vocab()只是根据训练数据构建其词汇表.在此步骤中,GloVe嵌入在这里如何涉及?
  1. Why are the GloVe embeddings loaded in a Field in addition to the Embedding?
  2. I thought the Field function build_vocab() just builds its vocabulary from the training data. How are the GloVe embeddings involved here during this step?

以下还有其他StackOverflow问题,没有回答了我的问题:

Here are other StackOverflow questions that did not answer my questions:

PyTorch/Gensim-如何加载经过预训练的单词嵌入

在 pytorch 中嵌入

PyTorch LSTM-使用单词嵌入而不是nn.Embedding()

感谢您的帮助.

推荐答案

torchtext 构建词汇表时,它将标记索引与嵌入对齐.如果您的词汇表的大小和顺序与预训练的嵌入词不同,则无法保证索引匹配,因此您可能会查找错误的嵌入词. build_vocab()为具有相应嵌入的数据集创建词汇表,并丢弃其余嵌入,因为这些未使用.

When torchtext builds the vocabulary, it aligns the the token indices with the embedding. If your vocabulary doesn't have the same size and ordering as the pre-trained embeddings, the indices wouldn't be guaranteed to match, therefore you might look up incorrect embeddings. build_vocab() creates the vocabulary for your dataset with the corresponding embeddings and discards the rest of the embeddings, because those are unused.

GloVe-6B嵌入包含一个大小为400K的词汇表.例如, IMDB数据集仅使用其中的120K,其他280K未使用.

The GloVe-6B embeddings includes a vocabulary of size 400K. For example the IMDB dataset only uses about 120K of these, the other 280K are unused.

import torch
from torchtext import data, datasets, vocab

TEXT = data.Field(tokenize='spacy', include_lengths=True)
LABEL = data.LabelField()

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train_data, vectors='glove.6B.100d')

TEXT.vocab.vectors.size() # => torch.Size([121417, 100])

# For comparison the full GloVe
glove = vocab.GloVe(name="6B", dim=100)
glove.vectors.size() # => torch.Size([400000, 100])

# Embedding of the first token is not the same
torch.equal(TEXT.vocab.vectors[0], glove.vectors[0]) # => False

# Index of the word "the"
TEXT.vocab.stoi["the"] # => 2
glove.stoi["the"] # => 0

# Same embedding when using the respective index of the same word
torch.equal(TEXT.vocab.vectors[2], glove.vectors[0]) # => True

构建具有嵌入词表的词汇表后,输入序列将以标记化版本给出,其中每个标记均由其索引表示.在模型中,您要使用这些内容的嵌入,因此需要创建嵌入层,但要使用词汇表的嵌入.最简单和推荐的方法是 nn.Embedding.from_pretrained ,与Keras版本基本相同.

After having built the vocabulary with its embeddings, the input sequences will be given in the tokenised version where each token is represented by its index. In the model you want to use the embedding of these, so you need to create the embedding layer, but with the embeddings of your vocabulary. The easiest and recommended way is nn.Embedding.from_pretrained, which is essentially the same as the Keras version.

embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors)

# Or if you want to make it trainable
trainable_embedding_layer = nn.Embedding.from_pretrained(TEXT.vocab.vectors, freeze=False)

您没有提到在Keras版本中如何创建 embedding_matrix ,也没有提到如何构建词汇表以使其可以与 embedding_matrix 一起使用.如果您手动执行此操作(或使用任何其他实用程序),则完全不需要 torchtext ,并且可以像在Keras中一样初始化嵌入. torchtext 纯粹是为了方便执行与通用数据相关的任务.

You didn't mention how the embedding_matrix is created in the Keras version, nor how the vocabulary is built such that it can be used with the embedding_matrix. If you do that by hand (or with any other utility), you don't need torchtext at all, and you can initialise the embeddings just like in Keras. torchtext is purely for convenience for common data related tasks.

这篇关于PyTorch:将单词向量加载到领域词汇与嵌入层中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆