src_mask 和 src_key_padding_mask 的区别 [英] Difference between src_mask and src_key_padding_mask

查看:17
本文介绍了src_mask 和 src_key_padding_mask 的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解变压器.一切都在一点一点地变得清晰,但让我头疼的一件事是src_mask 和 src_key_padding_mask 在编码器层和解码器层的前向函数中作为参数传递有什么区别.

I am having a difficult time in understanding transformers. Everything is getting clear bit by bit but one thing that makes my head scratch is what is the difference between src_mask and src_key_padding_mask which is passed as an argument in forward function in both encoder layer and decoder layer.

https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#Transformer

推荐答案

src_mask 和 src_key_padding_mask 的区别

一般的事情是注意使用张量 _mask_key_padding_mask 之间的区别.当注意力完成时,在转换器内部,我们通常会得到一个平方中间张量,其中包含所有比较大小为 [Tx, Tx](用于编码器的输入),[Ty, Ty](用于移位输出 - 解码器的输入之一)和 [Ty, Tx](用于内存掩码 - 编码器/内存的输出与解码器/移位输出的输入之间的注意).

Difference between src_mask and src_key_padding_mask

The general thing is to notice the difference between the use of the tensors _mask vs _key_padding_mask. Inside the transformer when attention is done we usually get an squared intermediate tensor with all the comparisons of size [Tx, Tx] (for the input to the encoder), [Ty, Ty] (for the shifted output - one of the inputs to the decoder) and [Ty, Tx] (for the memory mask - the attention between output of encoder/memory and input to decoder/shifted output).

所以我们知道这是变压器中每个掩码的用途(注意 pytorch 文档中的符号如下,其中 Tx=S 是源序列长度(例如输入批次的最大值),Ty=T 是目标序列长度(例如目标长度的最大值),B=N 是批量大小,D=E 是特征号):

So we get that this are the uses for each of the masks in the transformer (note the notation from the pytorch docs is as follows where Tx=S is the source sequence length (e.g. max of input batches), Ty=T is the target sequence length (e.g. max of target length), B=N is the batch size, D=E is the feature number):

  1. src_mask [Tx, Tx] = [S, S] – src 序列的附加掩码(可选).这在执行 atten_src + src_mask 时应用.我不确定输入示例 - 请参阅 tgt_mask 示例但典型的用途是添加 -inf 以便可以在需要时以这种方式屏蔽 src_attention.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.

  1. src_mask [Tx, Tx] = [S, S] – the additive mask for the src sequence (optional). This is applied when doing atten_src + src_mask. I'm not sure of an example input - see tgt_mask for an example but the typical use is to add -inf so one could mask the src_attention that way if desired. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

tgt_mask [Ty, Ty] = [T, T] – tgt 序列的附加掩码(可选).这在执行 atten_tgt + tgt_mask 时应用.一个示例用途是对角线,以避免解码器作弊.所以 tgt 右移,第一个标记是序列标记嵌入 SOS/BOS 的开始,因此第一个条目为零,而其余的.见附录中的具体示例.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.

tgt_mask [Ty, Ty] = [T, T] – the additive mask for the tgt sequence (optional). This is applied when doing atten_tgt + tgt_mask. An example use is the diagonal to avoid the decoder from cheating. So the tgt is right shifted, the first tokens are start of sequence token embedding SOS/BOS and thus the first entry is zero while the remaining. See concrete example at the appendix. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

memory_mask [Ty, Tx] = [T, S]——编码器输出的附加掩码(可选).这在执行 atten_memory + memory_mask 时应用.不确定示例用途,但如前所述,添加 -inf 将一些注意力权重设置为零.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.

memory_mask [Ty, Tx] = [T, S]– the additive mask for the encoder output (optional). This is applied when doing atten_memory + memory_mask. Not sure of an example use but as previously, adding -inf sets some of the attention weight to zero. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

src_key_padding_mask [B, Tx] = [N, S] – 每批 src 密钥的 ByteTensor 掩码(可选).由于您的 src 通常具有不同长度的序列,因此删除填充向量是很常见的你附加在最后.为此,您可以指定批次中每个示例的每个序列的长度.具体例子见附录.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.

src_key_padding_mask [B, Tx] = [N, S] – the ByteTensor mask for src keys per batch (optional). Since your src usually has different lengths sequences it's common to remove the padding vectors you appended at the end. For this you specify the length of each sequence per example in your batch. See concrete example in appendix. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

tgt_key_padding_mask [B, Ty] = [N, t] – 每批 tgt 密钥的 ByteTensor 掩码(可选).和以前一样.具体例子见附录.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.

tgt_key_padding_mask [B, Ty] = [N, t] – the ByteTensor mask for tgt keys per batch (optional). Same as previous. See concrete example in appendix. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

memory_key_padding_mask [B, Tx] = [N, S] – 每批内存密钥的 ByteTensor 掩码(可选).和以前一样.具体例子见附录.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.

memory_key_padding_mask [B, Tx] = [N, S] – the ByteTensor mask for memory keys per batch (optional). Same as previous. See concrete example in appendix. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

附录

来自 pytorch 教程的示例(https://pytorch.org/tutorials/beginner/translation_transformer.html):

    src_mask = torch.zeros((src_seq_len, src_seq_len), device=DEVICE).type(torch.bool)

返回大小为 [Tx, Tx] 的布尔张量:

returns a tensor of booleans of size [Tx, Tx]:

tensor([[False, False, False,  ..., False, False, False],
         ...,
        [False, False, False,  ..., False, False, False]])

2 tgt_mask 示例

    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1)
    mask = mask.transpose(0, 1).float()
    mask = mask.masked_fill(mask == 0, float('-inf'))
    mask = mask.masked_fill(mask == 1, float(0.0))

为解码器的输入生成右移输出的对角线.

generates the diagonal for the right shifted output which the input to the decoder.

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
         -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
         -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
         -inf, -inf, -inf],
         ...,
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])

通常右移输出在开始时有 BOS/SOS 并且教程只是简单地得到右移通过在前面附加 BOS/SOS,然后使用 tgt_input = tgt[:-1, :] 修剪最后一个元素.

usually the right shifted output has the BOS/SOS at the beginning and it's the tutorial gets the right shift simply by appending that BOS/SOS at the front and then triming the last element with tgt_input = tgt[:-1, :].

填充只是为了掩盖最后的填充.src 填充通常与内存填充相同.tgt 有它自己的序列,因此它有自己的填充.示例:

The padding is just to mask the padding at the end. The src padding is usually the same as the memory padding. The tgt has it's own sequences and thus it's own padding. Example:

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    memory_padding_mask = src_padding_mask

输出:

tensor([[False, False, False,  ...,  True,  True,  True],
        ...,
        [False, False, False,  ...,  True,  True,  True]])

请注意,False 表示那里没有填充标记(所以是的,在转换器前向传递中使用该值),True 表示有填充标记(因此将其屏蔽掉,以免变压器向前传递受到影响).

note that a False means there is no padding token there (so yes use that value in the transformer forward pass) and a True means that there is a padding token (so masked it out so the transformer pass forward does not get affected).

答案有点分散,但我发现只有这 3 个参考资料有用(单独的层文档/东西不是很有用,诚实):

The answers are sort of spread around but I found only these 3 references being useful (the separate layers docs/stuff wasn't very useful honesty):

这篇关于src_mask 和 src_key_padding_mask 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆