注意的LSTM [英] LSTM with Attention

查看:86
本文介绍了注意的LSTM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为堆叠的LSTM实现添加关注机制 https://github.com/salesforce/awd-lstm-lm

I am trying to add attention mechanism to stacked LSTMs implementation https://github.com/salesforce/awd-lstm-lm

所有在线示例都使用了我不想使用的编码器-解码器体系结构(我必须使用注意力机制吗?).

All examples online use encoder-decoder architecture, which I do not want to use (do I have to for the attention mechanism?).

基本上,我已经使用过

Basically, I have used https://webcache.googleusercontent.com/search?q=cache:81Q7u36DRPIJ:https://github.com/zhedongzheng/finch/blob/master/nlp-models/pytorch/rnn_attn_text_clf.py+&cd=2&hl=en&ct=clnk&gl=uk

def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, dropouth=0.5, dropouti=0.5, dropoute=0.1, wdrop=0, tie_weights=False):
    super(RNNModel, self).__init__()
    self.encoder = nn.Embedding(ntoken, ninp)
    self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
    for rnn in self.rnns:
        rnn.linear = WeightDrop(rnn.linear, ['weight'], dropout=wdrop)
    self.rnns = torch.nn.ModuleList(self.rnns)
    self.attn_fc = torch.nn.Linear(ninp, 1)
    self.decoder = nn.Linear(nhid, ntoken)

    self.init_weights()

def attention(self, rnn_out, state):
    state = torch.transpose(state, 1,2)
    weights = torch.bmm(rnn_out, state)# torch.bmm(rnn_out, state)
    weights = torch.nn.functional.softmax(weights)#.squeeze(2)).unsqueeze(2)
    rnn_out_t = torch.transpose(rnn_out, 1, 2)
    bmmed = torch.bmm(rnn_out_t, weights)
    bmmed = bmmed.squeeze(2)
    return bmmed

def forward(self, input, hidden, return_h=False, decoder=False, encoder_outputs=None):
    emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
    emb = self.lockdrop(emb, self.dropouti)

    new_hidden = []
    raw_outputs = []
    outputs = []
    for l, rnn in enumerate(self.rnns):
        temp = []
        for item in emb:
            item = item.unsqueeze(0)
            raw_output, new_h = rnn(item, hidden[l])

            raw_output = self.attention(raw_output, new_h[0])

            temp.append(raw_output)
        raw_output = torch.stack(temp)
        raw_output = raw_output.squeeze(1)

        new_hidden.append(new_h)
        raw_outputs.append(raw_output)
        if l != self.nlayers - 1:
            raw_output = self.lockdrop(raw_output, self.dropouth)
            outputs.append(raw_output)
    hidden = new_hidden

    output = self.lockdrop(raw_output, self.dropout)
    outputs.append(output)

    outputs = torch.stack(outputs).squeeze(0)
    outputs = torch.transpose(outputs, 2,1)
    output = output.transpose(2,1)
    output = output.contiguous()
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(output.size(0), output.size(1), decoded.size(1))
    if return_h:
        return result, hidden, raw_outputs, outputs
    return result, hidden

这个模型是训练中的,但是与没有注意力模型的模型相比,我的损失相当大.

This model is training, but my loss is quite high as compared to the model without the attention model.

推荐答案

我理解您的问题,但是遵循您的代码并找出损失没有减少的原因有些困难.同样,不清楚为什么要在每个时间步中将RNN的最后一个隐藏状态与所有隐藏状态进行比较.

I understood your question but it is a bit tough to follow your code and find the reason why the loss is not decreasing. Also, it is not clear why you want to compare the last hidden state of the RNN with all the hidden states at every time step.

请注意,如果以正确的方式使用特定的技巧/机制,则很有用. 您尝试使用注意力机制的方式,我不确定这是否正确.因此,不要指望因为您在模型中使用了注意技巧,所以您会获得不错的结果!!您应该考虑,为什么注意力集中机制会为您所需的任务带来好处?

Please note, a particular trick/mechanism is useful if you use it in the correct way. The way you are trying to use attention mechanism, I am not sure if it is the correct way. So, don't expect that since you are using attention trick in your model, you will get good results!! You should think, why attention mechanism will bring advantage to your desired task?

您没有明确提到您要定位的任务是什么?既然您已经指向了一个包含语言建模代码的存储库,那么我想的任务是:给定一系列标记,预测下一个标记.

You didn't clearly mention what is that task you are targetting? Since you have pointed to a repo which contains code on language modeling, I am guessing the task is: given a sequence of tokens, predict the next token.

我在您的代码中看到的一个可能的问题是:在for item in emb:循环中,您将始终将嵌入用作每个LSTM层的输入,因此对我而言,堆叠LSTM毫无意义.

One possible problem I can see in your code is: in the for item in emb: loop, you will always use the embedddings as input to each LSTM layer, so having a stacked LSTM doesn't make sense to me.

现在,让我首先回答您的问题,然后逐步展示如何构建所需的NN体系结构.

Now, let me first answer your question and then show step-by-step how can you build your desired NN architecture.

我需要使用编码器-解码器体系结构来使用注意力机制吗?

Do I need to use encoder-decoder architecture to use attention mechanism?

编码器-解码器体系结构被称为学习中的从序列到序列的体系结构,它广泛用于许多生成任务,例如机器翻译.您的问题的答案是,您不需要使用任何任何特定的神经网络架构来使用注意力机制.

The encoder-decoder architecture is better known as sequence-to-sequence to learning and it is widely used in many generation task, for example, machine translation. The answer to your question is no, you are not required to use any specific neural network architecture to use attention mechanism.

您在图中呈现的结构不太明确,但应该易于实现.由于我不清楚您的实现方式,因此我试图引导您找到一种更好的实现方式.对于以下讨论,我假设我们正在处理文本输入.

The structure you presented in the figure is little ambiguous but should be easy to implement. Since your implementation is not clear to me, I am trying to guide you to a better way of implementing it. For the following discussion, I am assuming we are dealing with text inputs.

比方说,我们有一个形状为16 x 10的输入,其中16batch_size10seq_len.我们可以假设一个小批量中有16个句子,每个句子的长度为10.

Let's say, we have an input of shape 16 x 10 where 16 is batch_size and 10 is seq_len. We can assume we have 16 sentences in a mini-batch and each sentence length is 10.

batch_size, vocab_size = 16, 100
mat = np.random.randint(vocab_size, size=(batch_size, 10))
input_var = Variable(torch.from_numpy(mat))

在这里,100可以看作是词汇量. 需要注意,在我提供的示例中,我假设batch_size作为所有张量/变量中的第一维.

Here, 100 can be considered as the vocabulary size. It is important to note that throughout the example I am providing, I am assuming batch_size as the first dimension in all respective tensors/variables.

现在,让我们嵌入输入变量.

Now, let's embed the input variable.

embedding = nn.Embedding(100, 50)
embed = embedding(input_var)

嵌入后,我们得到一个形状为16 x 10 x 50的变量,其中50是嵌入大小.

After embedding, we got a variable of shape 16 x 10 x 50 where 50 is the embedding size.

现在,让我们定义一个2层单向LSTM,每层100个隐藏单元.

Now, let's define a 2-layer unidirectional LSTM with 100 hidden units at each layer.

rnns = nn.ModuleList()
nlayers, input_size, hidden_size = 2, 50, 100
for i in range(nlayers):
    input_size = input_size if i == 0 else hidden_size
    rnns.append(nn.LSTM(input_size, hidden_size, 1, batch_first=True))

然后,我们可以将输入馈入此2层LSTM以获取输出.

Then, we can feed our input to this 2-layer LSTM to get the output.

sent_variable = embed
outputs, hid = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    outputs.append(output)
    hid.append(hidden[0].squeeze(0))
    sent_variable = output

rnn_out = torch.cat(outputs, 2)
hid = torch.cat(hid, 1)

现在,您只需使用hid即可预测下一个单词.我建议你这样做.在这里,hid的形状为batch_size x (num_layers*hidden_size).

Now, you can simply use the hid to predict the next word. I would suggest you do that. Here, shape of hid is batch_size x (num_layers*hidden_size).

但是,由于您要吸引注意力来计算最后一个隐藏状态与LSTM层产生的每个隐藏状态之间的软对齐分数,所以我们就可以这样做.

But since you want to use attention to compute soft alignment score between last hidden states with each hidden states produced by LSTM layers, let's do this.

sent_variable = embed
hid, con = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    sent_variable = output

    hidden = hidden[0].squeeze(0) # batch_size x hidden_size
    hid.append(hidden)
    weights = torch.bmm(output[:, 0:-1, :], hidden.unsqueeze(2)).squeeze(2)  
    soft_weights = F.softmax(weights, 1)  # batch_size x seq_len
    context = torch.bmm(output[:, 0:-1, :].transpose(1, 2), soft_weights.unsqueeze(2)).squeeze(2)
    con.append(context)

hid, con = torch.cat(hid, 1), torch.cat(con, 1)
combined = torch.cat((hid, con), 1)

在这里,我们计算最后一个状态与每个时间步的所有状态之间的软对齐分数.然后,我们计算上下文向量,它只是所有隐藏状态的线性组合.我们将它们组合成一个单一的表示形式.

Here, we compute soft alignment score between the last state with all the states of each time step. Then we compute a context vector which is just a linear combination of all the hidden states. We combine them to form a single representation.

请注意,由于您正在与上一个隐藏状态进行比较,因此我已从output:output[:, 0:-1, :]中删除了上一个隐藏状态.

Please note, I have removed the last hidden states from output: output[:, 0:-1, :] since you are comparing with last hidden state itself.

最终的combined表示形式存储了在每个层上生成的最后的隐藏状态和上下文向量.您可以直接使用此表示法来预测下一个单词.

The final combined representation stores the last hidden states and context vectors produced at each layer. You can directly use this representation to predict the next word.

预测下一个单词很简单,就像使用简单的线性图层一样.

Predicting the next word is straight-forward and as you are using a simple linear layer is just fine.

编辑:我们可以执行以下操作来预测下一个单词.

Edit: We can do the following to predict the next word.

decoder = nn.Linear(nlayers * hidden_size * 2, vocab_size)
dec_out = decoder(combined)

此处,dec_out的形状为batch_size x vocab_size.现在,我们可以计算负对数似然损失,这将在以后用于反向传播.

Here, the shape of dec_out is batch_size x vocab_size. Now, we can compute negative log-likelihood loss which will be used to backpropagate later.

在计算负对数似然损失之前,我们需要将log_softmax应用于解码器的输出.

Before computing the negative log-likelihood loss, we need to apply log_softmax to the output of the decoder.

dec_out = F.log_softmax(dec_out, 1)
target = np.random.randint(vocab_size, size=(batch_size))
target = Variable(torch.from_numpy(target))

而且,我们还定义了计算损耗所需的目标.有关详细信息,请参见 NLLLoss .因此,现在我们可以如下计算损失.

And, we also defined the target which is required to compute the loss. See NLLLoss for details. So, now we can compute the loss as follows.

criterion = nn.NLLLoss()
loss = criterion(dec_out, target)
print(loss)

打印的损失值为:

Variable containing:
 4.6278
[torch.FloatTensor of size 1]

希望整个说明对您有帮助!

Hope the entire explanation helps you!!

这篇关于注意的LSTM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆