是否可以在 pytorch 的嵌入层中仅冻结某些嵌入权重? [英] Is it possible to freeze only certain embedding weights in the embedding layer in pytorch?

查看:34
本文介绍了是否可以在 pytorch 的嵌入层中仅冻结某些嵌入权重?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 NLP 任务中使用 GloVe 嵌入时,数据集中的某些词可能不存在于 GloVe 中.因此,我们为这些未知词实例化随机权重.

When using GloVe embedding in NLP tasks, some words from the dataset might not exist in GloVe. Therefore, we instantiate random weights for these unknown words.

是否可以冻结从 GloVe 获得的权重,并仅训练新实例化的权重?

Would it be possible to freeze weights gotten from GloVe, and train only the newly instantiated weights?

我只知道我们可以设置:model.embedding.weight.requires_grad = False

I am only aware that we can set: model.embedding.weight.requires_grad = False

但这会使新单词无法训练..

But this makes the new words untrainable..

或者有更好的方法来提取单词的语义..

Or are there better ways to extract semantics of words..

推荐答案

1.将嵌入分成两个单独的对象

一种方法是使用两个单独的嵌入一个用于预训练,另一个用于待训练.

GloVe 应该被冻结,而没有预训练表示的手套将从可训练层中取出.

The GloVe one should be frozen, while the one for which there is no pretrained representation would be taken from the trainable layer.

如果您将数据格式化为用于预训练标记表示,它的范围比没有 GloVe 表示的标记小,则可以完成.假设您的预训练索引在 [0, 300] 范围内,而那些没有表示的索引是 [301, 500].我会按照这些思路去做:

If you format your data that for pretrained token representations it is in smaller range than the tokens without GloVe representation it could be done. Let's say your pretrained indices are in the range [0, 300], while those without representation are [301, 500]. I would go with something along those lines:

import numpy as np
import torch


class YourNetwork(torch.nn.Module):
    def __init__(self, glove_embeddings: np.array, how_many_tokens_not_present: int):
        self.pretrained_embedding = torch.nn.Embedding.from_pretrained(glove_embeddings)
        self.trainable_embedding = torch.nn.Embedding(
            how_many_tokens_not_present, glove_embeddings.shape[1]
        )
        # Rest of your network setup

    def forward(self, batch):
        # Which tokens in batch do not have representation, should have indices BIGGER
        # than the pretrained ones, adjust your data creating function accordingly
        mask = batch > self.pretrained_embedding.num_embeddings

        # You may want to optimize it, you could probably get away without copy, though
        # I'm not currently sure how
        pretrained_batch = batch.copy()
        pretrained_batch[mask] = 0

        embedded_batch = self.pretrained_embedding(pretrained_batch)

        # Every token without representation has to be brought into appropriate range
        batch -= self.pretrained_embedding.num_embeddings
        # Zero out the ones which already have pretrained embedding
        batch[~mask] = 0
        non_pretrained_embedded_batch = self.trainable_embedding(batch)

        # And finally change appropriate tokens from placeholder embedding created by
        # pretrained into trainable embeddings.
        embedded_batch[mask] = non_pretrained_embedded_batch[mask]

        # Rest of your code
        ...

假设您的预训练索引在 [0, 300] 范围内,而没有表示的索引是 [301, 500].

Let's say your pretrained indices are in the range [0, 300], while those without representation are [301, 500].

这个有点棘手,但我认为它非常简洁且易于实现.因此,如果您获得没有 GloVe 表示的标记的索引,您可以在反向传播后明确地将它们的梯度归零,这样这些行就不会更新.

This one is a bit tricky, but I think it's pretty concise and easy to implement. So, if you obtain the indices of tokens which got no GloVe representation, you can explicitly zero their gradient after backprop, so those rows will not get updated.

import torch

embedding = torch.nn.Embedding(10, 3)
X = torch.LongTensor([[1, 2, 4, 5], [4, 3, 2, 9]])

values = embedding(X)
loss = values.mean()

# Use whatever loss you want
loss.backward()

# Let's say those indices in your embedding are pretrained (have GloVe representation)
indices = torch.LongTensor([2, 4, 5])

print("Before zeroing out gradient")
print(embedding.weight.grad)

print("After zeroing out gradient")
embedding.weight.grad[indices] = 0
print(embedding.weight.grad)

以及第二种方法的输出:

And the output of the second approach:

Before zeroing out gradient
tensor([[0.0000, 0.0000, 0.0000],
        [0.0417, 0.0417, 0.0417],
        [0.0833, 0.0833, 0.0833],
        [0.0417, 0.0417, 0.0417],
        [0.0833, 0.0833, 0.0833],
        [0.0417, 0.0417, 0.0417],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0417, 0.0417, 0.0417]])
After zeroing out gradient
tensor([[0.0000, 0.0000, 0.0000],
        [0.0417, 0.0417, 0.0417],
        [0.0000, 0.0000, 0.0000],
        [0.0417, 0.0417, 0.0417],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000],
        [0.0417, 0.0417, 0.0417]])

这篇关于是否可以在 pytorch 的嵌入层中仅冻结某些嵌入权重?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆