LSTM RNN反向传播 [英] LSTM RNN Backpropagation

查看:55
本文介绍了LSTM RNN反向传播的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以对LSTM RNN的反向传播进行清楚的解释吗?这是我正在使用的类型结构.我的问题不是什么是反向传播,我理解这是一种计算假设误差和输出误差的逆序方法,用于调整神经网络的权重.我的问题是LSTM反向传播与常规神经网络有何不同.

我不确定如何找到每个门的初始误差.您是否对每个门使用第一个误差(由假设减去输出得出)?还是通过一些计算来调整每个门的误差?我不确定细胞状态是否会在LSTM的反向传播中发挥作用.我一直在寻找LSTM的良好来源,但至今仍未找到任何内容.

解决方案

这是一个很好的问题.您当然应该查看建议的帖子以了解详细信息,但是此处的完整示例也将有所帮助.

RNN向后传播

我认为先讨论普通的RNN(因为LSTM图特别令人困惑)并理解其反向传播是有意义的.

涉及反向传播时,关键思想是网络展开,这是将RNN中的递归转换为前馈序列的方式(如上图所示).请注意,抽象RNN是永恒的(可以任意大),但是每个特定的实现都受到限制,因为内存有限.结果,展开的网络实际上是一个长前馈网络,几乎没有复杂性,例如各个层的权重是共享的.

让我们看一个经典的例子,

在代码中,它只是三个矩阵和两个偏置向量:

 #模型参数Wxh = np.random.randn(hidden_​​size,vocab_size)* 0.01#输入为隐藏Whh = np.random.randn(hidden_​​size,hidden_​​size)* 0.01#隐藏到隐藏为什么= np.random.randn(vocab_size,hidden_​​size)* 0.01#隐藏到输出bh = np.zeros((hidden_​​size,1))#隐藏的偏见by = np.zeros(((vocab_size,1))#输出偏置 

前向通过非常简单,此示例使用softmax和交叉熵损失.请注意,每次迭代都使用相同的 W * h * 数组,但是输出和隐藏状态不同:

 #正向传递对于xrange(len(inputs))中的t:xs [t] = np.zeros((vocab_size,1))#以1-of-k表示编码xs [t] [inputs [t]] = 1hs [t] = np.tanh(np.dot(Wxh,xs [t])+ np.dot(Whh,hs [t-1])+ bh)#隐藏状态ys [t] = np.dot(为什么,hs [t])+通过下一个字符的未归一化对数概率ps [t] = np.exp(ys [t])/np.sum(np.exp(ys [t]))#下一个字符的概率损失+ = -np.log(ps [t] [targets [t],0])#softmax(交叉熵损失) 

现在,向后传递完全像前馈网络一样执行,但是 W * h * 数组的梯度会累积所有梯度单元格:

  for反向(xrange(len(inputs)))中的t:dy = np.copy(ps [t])dy [targets [t]]-= 1dWhy + = np.dot(dy,hs [t] .T)dby + = dydh = np.dot(Why.T,dy)+ dhnext#反向传播到hdhraw =(1-hs [t] * hs [t])* dh#通过tanh非线性进行反向传播dbh + = dhrawdWxh + = np.dot(dhraw,xs [t] .T)dWhh + = np.dot(dhraw,hs [t-1] .T)dhnext = np.dot(Whh.T,dhraw) 

以上两次均以大小为 len(inputs)的块完成,这与展开的RNN的大小相对应.您可能希望使其更大以捕获输入中的较长依赖性,但是您需要通过存储每个单元格的所有输出和渐变来为此付出代价.

LSTM有什么不同

LSTM的图片和公式看起来令人生畏,但是一旦您对普通RNN进行编码,LSTM的实现就几乎一样.例如,这是向后传递:

 #像以前一样循环遍历所有单元格d_h_next_t = np.zeros((N,H))d_c_next_t = np.zeros((N,H))对于t的反向(xrange(T)):d_x_t,d_h_prev_t,d_c_prev_t,d_Wx_t,d_Wh_t,d_b_t = lstm_step_backward(d_h_next_t + d_h [:,t,:],d_c_next_t,缓存[t])d_c_next_t = d_c_prev_td_h_next_t = d_h_prev_td_x [:,t ,:] = d_x_td_h0 = d_h_prev_td_Wx + = d_Wx_td_Wh + = d_Wh_td_b + = d_b_t#每个单元格中的步骤#以少量公式捕获所有LSTM复杂度.def lstm_step_backward(d_next_h,d_next_c,缓存):"向后传递LSTM的单个时间步.输入:-dnext_h:形状为(N,H)的下一个隐藏状态的渐变-dnext_c:下一个单元格状态的渐变,形状为(N,H)-缓存:正向传递的值返回一个元组:-dx:形状(N,D)的输入数据的梯度-dprev_h:形状为(N,H)的先前隐藏状态的渐变-dprev_c:形状为(N,H)的先前单元格状态的梯度-dWx:形状为(D,4H)的输入到隐藏的权重的梯度-dWh:形状为(H,4H)的隐藏到隐藏的权重的渐变-db:偏差的渐变,形状为(4H,)"x,prev_h,prev_c,Wx,Wh,a,i,f,o,g,next_c,z,next_h =缓存d_z = o * d_next_hd_o = z * d_next_hd_next_c + =(1-z * z)* d_zd_f = d_next_c * prev_cd_prev_c = d_next_c * fd_i = d_next_c * gd_g = d_next_c * id_a_g =(1-g * g)* d_gd_a_o = o *(1-o)* d_od_a_f = f *(1-f)* d_fd_a_i = i *(1-i)* d_id_a = np.catecatenate((d_a_i,d_a_f,d_a_o,d_a_g),轴= 1)d_prev_h = d_a.dot(Wh.T)d_Wh = prev_h.T.dot(d_a)d_x = d_a.dot(Wx.T)d_Wx = x.T.dot(d_a)d_b = np.sum(d_a,轴= 0)返回d_x,d_prev_h,d_prev_c,d_Wx,d_Wh,d_b 

摘要

现在,回到您的问题.

我的问题是LSTM反向传播与常规神经网络有何不同

不同层中的共享权重,还有一些您需要注意的其他变量(状态).除此之外,没有任何区别.

您是否对每个门使用第一个误差(由假设减去输出得出)?还是通过一些计算来调整每个门的误差?

首先,损失函数不一定是L2.在上面的示例中,这是交叉熵损失,因此初始误差信号会得到其梯度:

 #记住ps是前向传递的概率分布dy = np.copy(ps [t])dy [targets [t]]-= 1 

请注意,它是与普通前馈神经网络相同的错误信号.如果使用L2损耗,则信号的确等于地真减去实际输出.

对于LSTM,情况稍微复杂一些: d_next_h = d_h_next_t + d_h [:,t,:] ,其中 d_h 是上游梯度损失函数,这意味着每个单元的错误信号都会累积.但是再次,如果您展开LSTM,您将看到与网络布线的直接对应.

Could someone give a clear explanation of backpropagation for LSTM RNNs? This is the type structure I am working with. My question is not posed at what is back propagation, I understand it is a reverse order method of calculating the error of the hypothesis and output used for adjusting the weights of neural networks. My question is how LSTM backpropagation is different then regular neural networks.

I am unsure of how to find the initial error of each gates. Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation? I am unsure how the cell state plays a role in the backprop of LSTMs if it does at all. I have looked thoroughly for a good source for LSTMs but have yet to find any.

解决方案

That's a good question. You certainly should take a look at suggested posts for details, but a complete example here would be helpful too.

RNN Backpropagaion

I think it makes sense to talk about an ordinary RNN first (because LSTM diagram is particularly confusing) and understand its backpropagation.

When it comes to backpropagation, the key idea is network unrolling, which is way to transform the recursion in RNN into a feed-forward sequence (like on the picture above). Note that abstract RNN is eternal (can be arbitrarily large), but each particular implementation is limited because the memory is limited. As a result, the unrolled network really is a long feed-forward network, with few complications, e.g. the weights in different layers are shared.

Let's take a look at a classic example, char-rnn by Andrej Karpathy. Here each RNN cell produces two outputs h[t] (the state which is fed into the next cell) and y[t] (the output on this step) by the following formulas, where Wxh, Whh and Why are the shared parameters:

In the code, it's simply three matrices and two bias vectors:

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

The forward pass is pretty straightforward, this example uses softmax and cross-entropy loss. Note each iteration uses the same W* and h* arrays, but the output and hidden state are different:

# forward pass
for t in xrange(len(inputs)):
  xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
  xs[t][inputs[t]] = 1
  hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
  ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
  ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
  loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)

Now, the backward pass is performed exactly as if it was a feed-forward network, but the gradient of W* and h* arrays accumulates the gradients in all cells:

for t in reversed(xrange(len(inputs))):
  dy = np.copy(ps[t])
  dy[targets[t]] -= 1
  dWhy += np.dot(dy, hs[t].T)
  dby += dy
  dh = np.dot(Why.T, dy) + dhnext # backprop into h
  dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
  dbh += dhraw
  dWxh += np.dot(dhraw, xs[t].T)
  dWhh += np.dot(dhraw, hs[t-1].T)
  dhnext = np.dot(Whh.T, dhraw)

Both passes above are done in chunks of size len(inputs), which corresponds to the size of the unrolled RNN. You might want to make it bigger to capture longer dependencies in the input, but you pay for it by storing all outputs and gradients per each cell.

What's different in LSTMs

LSTM picture and formulas look intimidating, but once you coded plain vanilla RNN, the implementation of LSTM is pretty much same. For example, here is the backward pass:

# Loop over all cells, like before
d_h_next_t = np.zeros((N, H))
d_c_next_t = np.zeros((N, H))
for t in reversed(xrange(T)):
  d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t])
  d_c_next_t = d_c_prev_t
  d_h_next_t = d_h_prev_t

  d_x[:,t,:] = d_x_t
  d_h0 = d_h_prev_t
  d_Wx += d_Wx_t
  d_Wh += d_Wh_t
  d_b += d_b_t

# The step in each cell
# Captures all LSTM complexity in few formulas.
def lstm_step_backward(d_next_h, d_next_c, cache):
  """
  Backward pass for a single timestep of an LSTM.

  Inputs:
  - dnext_h: Gradients of next hidden state, of shape (N, H)
  - dnext_c: Gradients of next cell state, of shape (N, H)
  - cache: Values from the forward pass

  Returns a tuple of:
  - dx: Gradient of input data, of shape (N, D)
  - dprev_h: Gradient of previous hidden state, of shape (N, H)
  - dprev_c: Gradient of previous cell state, of shape (N, H)
  - dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
  - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
  - db: Gradient of biases, of shape (4H,)
  """
  x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache

  d_z = o * d_next_h
  d_o = z * d_next_h
  d_next_c += (1 - z * z) * d_z

  d_f = d_next_c * prev_c
  d_prev_c = d_next_c * f
  d_i = d_next_c * g
  d_g = d_next_c * i

  d_a_g = (1 - g * g) * d_g
  d_a_o = o * (1 - o) * d_o
  d_a_f = f * (1 - f) * d_f
  d_a_i = i * (1 - i) * d_i
  d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1)

  d_prev_h = d_a.dot(Wh.T)
  d_Wh = prev_h.T.dot(d_a)

  d_x = d_a.dot(Wx.T)
  d_Wx = x.T.dot(d_a)

  d_b = np.sum(d_a, axis=0)

  return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b

Summary

Now, back to your questions.

My question is how is LSTM backpropagation different then regular Neural Networks

The are shared weights in different layers, and few more additional variables (states) that you need to pay attention to. Other than this, no difference at all.

Do you use the first error (calculated by hypothesis minus output) for each gate? Or do you adjust the error for each gate through some calculation?

First up, the loss function is not necessarily L2. In the example above it's a cross-entropy loss, so initial error signal gets its gradient:

# remember that ps is the probability distribution from the forward pass
dy = np.copy(ps[t])  
dy[targets[t]] -= 1

Note that it's the same error signal as in ordinary feed-forward neural network. If you use L2 loss, the signal indeed equals to ground-truth minus actual output.

In case of LSTM, it's slightly more complicated: d_next_h = d_h_next_t + d_h[:,t,:], where d_h is the upstream gradient the loss function, which means that error signal of each cell gets accumulated. But once again, if you unroll LSTM, you'll see a direct correspondence with the network wiring.

这篇关于LSTM RNN反向传播的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆