为什么在Transformer模型中将嵌入向量乘以常数? [英] Why does embedding vector multiplied by a constant in Transformer model?

查看:466
本文介绍了为什么在Transformer模型中将嵌入向量乘以常数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习应用注意是您所需要的提出的转换模型 >来自tensorflow官方文档用于理解语言的Transformer模型.

I am learning to apply Transform model proposed by Attention Is All You Need from tensorflow official document Transformer model for language understanding.

位置编码部分中说:

由于该模型不包含任何重复或卷积, 添加了位置编码,以为模型提供有关以下内容的信息 句子中单词的相对位置.

Since this model doesn't contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

位置编码矢量已添加到嵌入矢量.

我的理解是将positional encoding vector直接添加到embedding vector.但是当我查看代码时,发现embedding vector乘以一个常数.

My understanding is to add positional encoding vector directly to embedding vector. But I found embedding vector multiplied by a constant when I looked at the code.

编码器部分中的代码如下:

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
               rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)


    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

我们可以在x += self.pos_encoding[:, :seq_len, :]之前看到x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)).

为什么在Transformer模型中添加位置编码之前,将嵌入矢量乘以一个常数?

So why does embedding vector multiplied by a constant before adding positional encoding in Transformer model?

推荐答案

环顾四周,我发现此参数

Looking around it, I found this argument 1:

我们在添加之前增加嵌入值的原因是 使位置编码相对较小.这意味着 当我们添加时,嵌入向量中的原始含义不会丢失 他们在一起.

The reason we increase the embedding values before the addition is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

这篇关于为什么在Transformer模型中将嵌入向量乘以常数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆