时间分布(密集)与 Keras 中的密集 - 相同数量的参数 [英] TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

查看:26
本文介绍了时间分布(密集)与 Keras 中的密集 - 相同数量的参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个模型,该模型使用循环层 (GRU) 将一个字符串转换为另一个字符串.我已经尝试了 Dense 和 TimeDistributed(Dense) 层作为最后一层,但我不明白使用 return_sequences=True 时两者之间的区别,尤其是因为它们似乎具有相同数量的参数.

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

我的简化模型如下:

InputSize = 15
MaxLen = 64
HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)

网络总结为:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

这对我来说很有意义,因为我对 TimeDistributed 的理解是它在所有时间点应用同一层,因此 Dense 层有 16*15+15=255 个参数(权重+偏差).

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

但是,如果我切换到一个简单的 Dense 层:

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)

我仍然只有 255 个参数:

I still only have 255 parameters:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
dense_1 (Dense)              (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

我想知道这是不是因为 Dense() 只会使用形状中的最后一个维度,并且有效地将其他所有维度视为类似批次的维度.但后来我不再确定 Dense 和 TimeDistributed(Dense) 之间有什么区别.

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

更新查看https://github.com/fchollet/keras/blob/master/keras/layers/core.py 似乎 Dense 只使用最后一个维度来调整自己的大小:

Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:

def build(self, input_shape):
    assert len(input_shape) >= 2
    input_dim = input_shape[-1]

    self.kernel = self.add_weight(shape=(input_dim, self.units),

它还使用 keras.dot 来应用权重:

It also uses keras.dot to apply the weights:

def call(self, inputs):
    output = K.dot(inputs, self.kernel)

keras.dot 的文档暗示它在 n 维张量上工作正常.我想知道它的确切行为是否意味着 Dense() 实际上会在每个时间步被调用.如果是这样,问题仍然是 TimeDistributed() 在这种情况下实现了什么.

The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

推荐答案

TimeDistributedDense 在 GRU/LSTM 单元展开期间对每个时间步应用相同的密集.所以误差函数将介于预测的标签序列和实际的标签序列之间.(这通常是序列到序列标记问题的要求).

TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).

然而,使用 return_sequences=FalseDense 层只在最后一个单元格应用一次.当 RNN 用于分类问题时,通常就是这种情况.如果 return_sequences=True 那么 Dense 层被应用到每个时间步,就像 TimeDistributedDense 一样.

However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.

因此,根据您的模型,两者都是相同的,但是如果您将第二个模型更改为 return_sequences=False,则 Dense 将仅应用于最后一个单元格.尝试更改它,模型将抛出错误,因为 Y 的大小将是 [Batch_size, InputSize],它不再是要排序的序列,而是一个完整的序列标记问题.

So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np

InputSize = 15
MaxLen = 64
HiddenSize = 16

OutputSize = 8
n_samples = 1000

model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')


model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())
print(model2.summary())
print(model3.summary())

在上面的示例架构中,model1model2 是示例(序列到序列模型),model3 是标记模型的完整序列.

In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.

这篇关于时间分布(密集)与 Keras 中的密集 - 相同数量的参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆