src_mask 和 src_key_padding_mask 的区别 [英] Difference between src_mask and src_key_padding_mask
问题描述
我很难理解变压器.一切都在一点一点地变得清晰,但让我头疼的一件事是src_mask 和 src_key_padding_mask 在编码器层和解码器层的前向函数中作为参数传递有什么区别.
I am having a difficult time in understanding transformers. Everything is getting clear bit by bit but one thing that makes my head scratch is what is the difference between src_mask and src_key_padding_mask which is passed as an argument in forward function in both encoder layer and decoder layer.
https://pytorch.org/docs/master/_modules/torch/nn/modules/transformer.html#Transformer
推荐答案
src_mask 和 src_key_padding_mask 的区别
一般的事情是注意使用张量 _mask
与 _key_padding_mask
之间的区别.当注意力完成时,在转换器内部,我们通常会得到一个平方中间张量,其中包含所有比较大小为 [Tx, Tx]
(用于编码器的输入),[Ty, Ty]
(用于移位输出 - 解码器的输入之一)和 [Ty, Tx]
(用于内存掩码 - 编码器/内存的输出与解码器/移位输出的输入之间的注意).
Difference between src_mask and src_key_padding_mask
The general thing is to notice the difference between the use of the tensors _mask
vs _key_padding_mask
.
Inside the transformer when attention is done we usually get an squared intermediate tensor with all the comparisons
of size [Tx, Tx]
(for the input to the encoder), [Ty, Ty]
(for the shifted output - one of the inputs to the decoder)
and [Ty, Tx]
(for the memory mask - the attention between output of encoder/memory and input to decoder/shifted output).
所以我们知道这是变压器中每个掩码的用途(注意 pytorch 文档中的符号如下,其中 Tx=S 是源序列长度
(例如输入批次的最大值),Ty=T 是目标序列长度
(例如目标长度的最大值),B=N 是批量大小
,D=E 是特征号
):
So we get that this are the uses for each of the masks in the transformer
(note the notation from the pytorch docs is as follows where Tx=S is the source sequence length
(e.g. max of input batches),
Ty=T is the target sequence length
(e.g. max of target length),
B=N is the batch size
,
D=E is the feature number
):
src_mask
[Tx, Tx] = [S, S]
– src 序列的附加掩码(可选).这在执行atten_src + src_mask
时应用.我不确定输入示例 - 请参阅 tgt_mask 示例但典型的用途是添加-inf
以便可以在需要时以这种方式屏蔽 src_attention.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.
src_mask
[Tx, Tx] = [S, S]
– the additive mask for the src sequence (optional). This is applied when doingatten_src + src_mask
. I'm not sure of an example input - see tgt_mask for an example but the typical use is to add-inf
so one could mask the src_attention that way if desired. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.
tgt_mask [Ty, Ty] = [T, T]
– tgt 序列的附加掩码(可选).这在执行 atten_tgt + tgt_mask
时应用.一个示例用途是对角线,以避免解码器作弊.所以 tgt 右移,第一个标记是序列标记嵌入 SOS/BOS 的开始,因此第一个条目为零,而其余的.见附录中的具体示例.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.
tgt_mask [Ty, Ty] = [T, T]
– the additive mask for the tgt sequence (optional).
This is applied when doing atten_tgt + tgt_mask
. An example use is the diagonal to avoid the decoder from cheating.
So the tgt is right shifted, the first tokens are start of sequence token embedding SOS/BOS and thus the first
entry is zero while the remaining. See concrete example at the appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
memory_mask [Ty, Tx] = [T, S]
——编码器输出的附加掩码(可选).这在执行 atten_memory + memory_mask
时应用.不确定示例用途,但如前所述,添加 -inf
将一些注意力权重设置为零.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.
memory_mask [Ty, Tx] = [T, S]
– the additive mask for the encoder output (optional).
This is applied when doing atten_memory + memory_mask
.
Not sure of an example use but as previously, adding -inf
sets some of the attention weight to zero.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
src_key_padding_mask [B, Tx] = [N, S]
– 每批 src 密钥的 ByteTensor 掩码(可选).由于您的 src 通常具有不同长度的序列,因此删除填充向量是很常见的你附加在最后.为此,您可以指定批次中每个示例的每个序列的长度.具体例子见附录.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.
src_key_padding_mask [B, Tx] = [N, S]
– the ByteTensor mask for src keys per batch (optional).
Since your src usually has different lengths sequences it's common to remove the padding vectors
you appended at the end.
For this you specify the length of each sequence per example in your batch.
See concrete example in appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
tgt_key_padding_mask [B, Ty] = [N, t]
– 每批 tgt 密钥的 ByteTensor 掩码(可选).和以前一样.具体例子见附录.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.
tgt_key_padding_mask [B, Ty] = [N, t]
– the ByteTensor mask for tgt keys per batch (optional).
Same as previous.
See concrete example in appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
memory_key_padding_mask [B, Tx] = [N, S]
– 每批内存密钥的 ByteTensor 掩码(可选).和以前一样.具体例子见附录.如果提供了 ByteTensor,则非零位置不允许参加,而零位置将保持不变.如果提供了 BoolTensor,则不允许出现 True 的位置,而 False 的值将保持不变.如果提供了 FloatTensor,它将被添加到注意力权重中.
memory_key_padding_mask [B, Tx] = [N, S]
– the ByteTensor mask for memory keys per batch (optional).
Same as previous.
See concrete example in appendix.
If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged.
If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged.
If a FloatTensor is provided, it will be added to the attention weight.
附录
来自 pytorch 教程的示例(https://pytorch.org/tutorials/beginner/translation_transformer.html):
src_mask = torch.zeros((src_seq_len, src_seq_len), device=DEVICE).type(torch.bool)
返回大小为 [Tx, Tx]
的布尔张量:
returns a tensor of booleans of size [Tx, Tx]
:
tensor([[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False]])
2 tgt_mask 示例
mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1)
mask = mask.transpose(0, 1).float()
mask = mask.masked_fill(mask == 0, float('-inf'))
mask = mask.masked_fill(mask == 1, float(0.0))
为解码器的输入生成右移输出的对角线.
generates the diagonal for the right shifted output which the input to the decoder.
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
-inf, -inf, -inf],
[0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
-inf, -inf, -inf],
[0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf,
-inf, -inf, -inf],
...,
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., -inf],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0.]])
通常右移输出在开始时有 BOS/SOS 并且教程只是简单地得到右移通过在前面附加 BOS/SOS,然后使用 tgt_input = tgt[:-1, :]
修剪最后一个元素.
usually the right shifted output has the BOS/SOS at the beginning and it's the tutorial gets the right shift simply
by appending that BOS/SOS at the front and then triming the last element with tgt_input = tgt[:-1, :]
.
填充只是为了掩盖最后的填充.src 填充通常与内存填充相同.tgt 有它自己的序列,因此它有自己的填充.示例:
The padding is just to mask the padding at the end. The src padding is usually the same as the memory padding. The tgt has it's own sequences and thus it's own padding. Example:
src_padding_mask = (src == PAD_IDX).transpose(0, 1)
tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
memory_padding_mask = src_padding_mask
输出:
tensor([[False, False, False, ..., True, True, True],
...,
[False, False, False, ..., True, True, True]])
请注意,False
表示那里没有填充标记(所以是的,在转换器前向传递中使用该值),True
表示有填充标记(因此将其屏蔽掉,以免变压器向前传递受到影响).
note that a False
means there is no padding token there (so yes use that value in the transformer forward pass) and a True
means that there is a padding token (so masked it out so the transformer pass forward does not get affected).
答案有点分散,但我发现只有这 3 个参考资料有用(单独的层文档/东西不是很有用,诚实):
The answers are sort of spread around but I found only these 3 references being useful (the separate layers docs/stuff wasn't very useful honesty):
- 长教程:https://pytorch.org/tutorials/beginner/translation_transformer.html一个>
- MHA 文档:https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention
- 变压器文档:https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html
这篇关于src_mask 和 src_key_padding_mask 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!