带有注意力的 LSTM [英] LSTM with Attention

查看:32
本文介绍了带有注意力的 LSTM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为堆叠的 LSTM 实现添加注意力机制

这个模型正在训练,但与没有注意力模型的模型相比,我的损失相当高.

解决方案

我理解你的问题,但是按照你的代码找到损失没有减少的原因有点困难.此外,不清楚为什么要在每个时间步将 RNN 的最后一个隐藏状态与所有隐藏状态进行比较.

请注意,如果您以正确的方式使用特定的技巧/机制,它就会很有用.您尝试使用注意力机制的方式,我不确定它是否是正确的方式.所以,不要指望因为你在你的模型中使用了注意力技巧,你会得到很好的结果!!你应该想一想,为什么注意力机制会为你想要的任务带来好处?

<小时>

你没有明确提到你的目标是什么?由于您指向了一个包含语言建模代码的存储库,我猜测任务是:给定一系列标记,预测下一个标记.

我在您的代码中看到的一个可能的问题是:在 for item in emb: 循环中,您将始终使用嵌入作为每个 LSTM 层的输入,因此堆叠 LSTM 不会对我来说没有意义.

<小时>

现在,让我先回答您的问题,然后逐步展示您如何构建所需的 NN 架构.

<块引用>

我是否需要使用编码器-解码器架构来使用注意力机制?

编码器-解码器架构更广为人知的是用于学习的序列到序列,它广泛用于许多生成任务,例如机器翻译.您的问题的答案是,您不需要使用任何特定的神经网络架构来使用注意力机制.

<小时>

你在图中展示的结构有点含糊,但应该很容易实现.由于我不清楚您的实施方式,因此我试图引导您采用更好的实施方式.对于以下讨论,我假设我们正在处理文本输入.

假设我们有一个形状为 16 x 10 的输入,其中 16batch_size10seq_len.我们可以假设我们在一个 mini-batch 中有 16 个句子,每个句子长度为 10.

batch_size, vocab_size = 16, 100mat = np.random.randint(vocab_size, size=(batch_size, 10))input_var = 变量(torch.from_numpy(mat))

这里,100 可以被认为是词汇量.重要的是要注意,在我提供的整个示例中,我假设 batch_size 作为所有相应张量/变量中的第一个维度.

现在,让我们嵌入输入变量.

embedding = nn.Embedding(100, 50)嵌入 = 嵌入(输入变量)

嵌入后,我们得到一个形状为 16 x 10 x 50 的变量,其中 50 是嵌入大小.

现在,让我们定义一个 2 层单向 LSTM,每层有 100 个隐藏单元.

rnns = nn.ModuleList()nlayers, input_size, hidden_​​size = 2, 50, 100对于我在范围内(nlayers):input_size = input_size if i == 0 else hidden_​​sizernns.append(nn.LSTM(input_size, hidden_​​size, 1, batch_first=True))

然后,我们可以将我们的输入提供给这个 2 层 LSTM 以获得输出.

sent_variable = 嵌入输出,隐藏 = [], []对于我在范围内(nlayers):如果我 != 0:sent_variable = F.dropout(sent_variable, p=0.3, training=True)输出,隐藏 = rnns[i](sent_variable)输出.附加(输出)hid.append(hidden[0].squeeze(0))sent_variable = 输出rnn_out = torch.cat(输出,2)hid = torch.cat(隐藏,1)

现在,您可以简单地使用 hid 来预测下一个单词.我建议你这样做.这里hid的形状是batch_size x (num_layers*hidden_​​size).

但是由于您想使用注意力来计算最后一个隐藏状态与 LSTM 层产生的每个隐藏状态之间的软对齐分数,让我们这样做.

sent_variable = 嵌入隐藏,con = [], []对于我在范围内(nlayers):如果我 != 0:sent_variable = F.dropout(sent_variable, p=0.3, training=True)输出,隐藏 = rnns[i](sent_variable)sent_variable = 输出hidden = hidden[0].squeeze(0) #batch_size x hidden_​​sizehid.append(隐藏)weights = torch.bmm(output[:, 0:-1, :], hidden.unsqueeze(2)).squeeze(2)soft_weights = F.softmax(weights, 1) #batch_size x seq_lencontext = torch.bmm(output[:, 0:-1, :].transpose(1, 2), soft_weights.unsqueeze(2)).squeeze(2)con.append(上下文)隐藏, con = torch.cat(hid, 1), torch.cat(con, 1)组合 = torch.cat((hid, con), 1)

在这里,我们计算最后一个状态与每个时间步的所有状态之间的软对齐分数.然后我们计算一个上下文向量,它只是所有隐藏状态的线性组合.我们将它们组合起来形成一个单一的表示.

请注意,我已经从 output: output[:, 0:-1, :] 中删除了最后的隐藏状态,因为你正在与最后一个隐藏状态本身进行比较.

最终的 combined 表示存储在每一层产生的最后隐藏状态和上下文向量.你可以直接用这个表示来预测下一个词.

预测下一个单词很简单,因为您使用的是简单的线性层就可以了.

<小时>

编辑:我们可以执行以下操作来预测下一个单词.

decoder = nn.Linear(nlayers * hidden_​​size * 2, vocab_size)dec_out = 解码器(组合)

这里,dec_out 的形状是 batch_size x vocab_size.现在,我们可以计算负对数似然损失,稍后将用于反向传播.

在计算负对数似然损失之前,我们需要将 log_softmax 应用于解码器的输出.

dec_out = F.log_softmax(dec_out, 1)目标 = np.random.randint(vocab_size, size=(batch_size))目标 = 变量(torch.from_numpy(目标))

而且,我们还定义了计算损失所需的目标.有关详细信息,请参阅 NLLLoss.所以,现在我们可以计算损失如下.

criterion = nn.NLLLoss()损失=标准(dec_out,目标)打印(损失)

打印的损失值为:

变量包含:4.6278[大小为 1 的torch.FloatTensor]

希望整个解释对你有帮助!!

I am trying to add attention mechanism to stacked LSTMs implementation https://github.com/salesforce/awd-lstm-lm

All examples online use encoder-decoder architecture, which I do not want to use (do I have to for the attention mechanism?).

Basically, I have used https://webcache.googleusercontent.com/search?q=cache:81Q7u36DRPIJ:https://github.com/zhedongzheng/finch/blob/master/nlp-models/pytorch/rnn_attn_text_clf.py+&cd=2&hl=en&ct=clnk&gl=uk

def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, dropouth=0.5, dropouti=0.5, dropoute=0.1, wdrop=0, tie_weights=False):
    super(RNNModel, self).__init__()
    self.encoder = nn.Embedding(ntoken, ninp)
    self.rnns = [torch.nn.LSTM(ninp if l == 0 else nhid, nhid if l != nlayers - 1 else (ninp if tie_weights else nhid), 1, dropout=0) for l in range(nlayers)]
    for rnn in self.rnns:
        rnn.linear = WeightDrop(rnn.linear, ['weight'], dropout=wdrop)
    self.rnns = torch.nn.ModuleList(self.rnns)
    self.attn_fc = torch.nn.Linear(ninp, 1)
    self.decoder = nn.Linear(nhid, ntoken)

    self.init_weights()

def attention(self, rnn_out, state):
    state = torch.transpose(state, 1,2)
    weights = torch.bmm(rnn_out, state)# torch.bmm(rnn_out, state)
    weights = torch.nn.functional.softmax(weights)#.squeeze(2)).unsqueeze(2)
    rnn_out_t = torch.transpose(rnn_out, 1, 2)
    bmmed = torch.bmm(rnn_out_t, weights)
    bmmed = bmmed.squeeze(2)
    return bmmed

def forward(self, input, hidden, return_h=False, decoder=False, encoder_outputs=None):
    emb = embedded_dropout(self.encoder, input, dropout=self.dropoute if self.training else 0)
    emb = self.lockdrop(emb, self.dropouti)

    new_hidden = []
    raw_outputs = []
    outputs = []
    for l, rnn in enumerate(self.rnns):
        temp = []
        for item in emb:
            item = item.unsqueeze(0)
            raw_output, new_h = rnn(item, hidden[l])

            raw_output = self.attention(raw_output, new_h[0])

            temp.append(raw_output)
        raw_output = torch.stack(temp)
        raw_output = raw_output.squeeze(1)

        new_hidden.append(new_h)
        raw_outputs.append(raw_output)
        if l != self.nlayers - 1:
            raw_output = self.lockdrop(raw_output, self.dropouth)
            outputs.append(raw_output)
    hidden = new_hidden

    output = self.lockdrop(raw_output, self.dropout)
    outputs.append(output)

    outputs = torch.stack(outputs).squeeze(0)
    outputs = torch.transpose(outputs, 2,1)
    output = output.transpose(2,1)
    output = output.contiguous()
    decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
    result = decoded.view(output.size(0), output.size(1), decoded.size(1))
    if return_h:
        return result, hidden, raw_outputs, outputs
    return result, hidden

This model is training, but my loss is quite high as compared to the model without the attention model.

解决方案

I understood your question but it is a bit tough to follow your code and find the reason why the loss is not decreasing. Also, it is not clear why you want to compare the last hidden state of the RNN with all the hidden states at every time step.

Please note, a particular trick/mechanism is useful if you use it in the correct way. The way you are trying to use attention mechanism, I am not sure if it is the correct way. So, don't expect that since you are using attention trick in your model, you will get good results!! You should think, why attention mechanism will bring advantage to your desired task?


You didn't clearly mention what is that task you are targetting? Since you have pointed to a repo which contains code on language modeling, I am guessing the task is: given a sequence of tokens, predict the next token.

One possible problem I can see in your code is: in the for item in emb: loop, you will always use the embedddings as input to each LSTM layer, so having a stacked LSTM doesn't make sense to me.


Now, let me first answer your question and then show step-by-step how can you build your desired NN architecture.

Do I need to use encoder-decoder architecture to use attention mechanism?

The encoder-decoder architecture is better known as sequence-to-sequence to learning and it is widely used in many generation task, for example, machine translation. The answer to your question is no, you are not required to use any specific neural network architecture to use attention mechanism.


The structure you presented in the figure is little ambiguous but should be easy to implement. Since your implementation is not clear to me, I am trying to guide you to a better way of implementing it. For the following discussion, I am assuming we are dealing with text inputs.

Let's say, we have an input of shape 16 x 10 where 16 is batch_size and 10 is seq_len. We can assume we have 16 sentences in a mini-batch and each sentence length is 10.

batch_size, vocab_size = 16, 100
mat = np.random.randint(vocab_size, size=(batch_size, 10))
input_var = Variable(torch.from_numpy(mat))

Here, 100 can be considered as the vocabulary size. It is important to note that throughout the example I am providing, I am assuming batch_size as the first dimension in all respective tensors/variables.

Now, let's embed the input variable.

embedding = nn.Embedding(100, 50)
embed = embedding(input_var)

After embedding, we got a variable of shape 16 x 10 x 50 where 50 is the embedding size.

Now, let's define a 2-layer unidirectional LSTM with 100 hidden units at each layer.

rnns = nn.ModuleList()
nlayers, input_size, hidden_size = 2, 50, 100
for i in range(nlayers):
    input_size = input_size if i == 0 else hidden_size
    rnns.append(nn.LSTM(input_size, hidden_size, 1, batch_first=True))

Then, we can feed our input to this 2-layer LSTM to get the output.

sent_variable = embed
outputs, hid = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    outputs.append(output)
    hid.append(hidden[0].squeeze(0))
    sent_variable = output

rnn_out = torch.cat(outputs, 2)
hid = torch.cat(hid, 1)

Now, you can simply use the hid to predict the next word. I would suggest you do that. Here, shape of hid is batch_size x (num_layers*hidden_size).

But since you want to use attention to compute soft alignment score between last hidden states with each hidden states produced by LSTM layers, let's do this.

sent_variable = embed
hid, con = [], []
for i in range(nlayers):
    if i != 0:
        sent_variable = F.dropout(sent_variable, p=0.3, training=True)
    output, hidden = rnns[i](sent_variable)
    sent_variable = output

    hidden = hidden[0].squeeze(0) # batch_size x hidden_size
    hid.append(hidden)
    weights = torch.bmm(output[:, 0:-1, :], hidden.unsqueeze(2)).squeeze(2)  
    soft_weights = F.softmax(weights, 1)  # batch_size x seq_len
    context = torch.bmm(output[:, 0:-1, :].transpose(1, 2), soft_weights.unsqueeze(2)).squeeze(2)
    con.append(context)

hid, con = torch.cat(hid, 1), torch.cat(con, 1)
combined = torch.cat((hid, con), 1)

Here, we compute soft alignment score between the last state with all the states of each time step. Then we compute a context vector which is just a linear combination of all the hidden states. We combine them to form a single representation.

Please note, I have removed the last hidden states from output: output[:, 0:-1, :] since you are comparing with last hidden state itself.

The final combined representation stores the last hidden states and context vectors produced at each layer. You can directly use this representation to predict the next word.

Predicting the next word is straight-forward and as you are using a simple linear layer is just fine.


Edit: We can do the following to predict the next word.

decoder = nn.Linear(nlayers * hidden_size * 2, vocab_size)
dec_out = decoder(combined)

Here, the shape of dec_out is batch_size x vocab_size. Now, we can compute negative log-likelihood loss which will be used to backpropagate later.

Before computing the negative log-likelihood loss, we need to apply log_softmax to the output of the decoder.

dec_out = F.log_softmax(dec_out, 1)
target = np.random.randint(vocab_size, size=(batch_size))
target = Variable(torch.from_numpy(target))

And, we also defined the target which is required to compute the loss. See NLLLoss for details. So, now we can compute the loss as follows.

criterion = nn.NLLLoss()
loss = criterion(dec_out, target)
print(loss)

The printed loss value is:

Variable containing:
 4.6278
[torch.FloatTensor of size 1]

Hope the entire explanation helps you!!

这篇关于带有注意力的 LSTM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆