如果省略input_shape,Keras模型的结构是什么?为什么它会表现更好? [英] What is the structure of a Keras model if input_shape is omitted and why does it perform better?

查看:573
本文介绍了如果省略input_shape,Keras模型的结构是什么?为什么它会表现更好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我错误地省略了Keras模型第一层中的input_shape.最终,我注意到并修复了该问题–模型的性能急剧下降.

I omitted the input_shape in the first layer of my Keras model by mistake. Eventually I noticed this and fixed it – and my model's performance dropped dramatically.

查看带有和不带有input_shape的模型的结构,我发现性能更好的模型的输出形状为multiple.此外,用plot_model进行绘制显示在各层之间没有任何连接:

Looking at the structure of the model with and without input_shape, I discovered that the better-performing model has the output shape of multiple. Moreover, plotting it with plot_model shows no connections between the layers:

在性能方面,我理解的模型(带有input_shape)在用我的测试代码(下图)经过10个周期后达到了4.0513(MSE)的验证损失,而怪异"模型管理的是1.3218 –仅区别在于随着更多的时代而增加.

When it comes to performance, the model I understand (with input_shape) achieves a validation loss of 4.0513 (MSE) after 10 epochs with my test code (below), while the "weird" model manages 1.3218 – and the difference only increases with more epochs.

模型定义:

model = keras.Sequential()
model.add(keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(1001,)))
#                                   add or remove this  ^^^^^^^^^^^^^^^^^^^
model.add(keras.layers.Dropout(0.05))
...

(不用担心细节,这只是一个模型,展示了有无input_shape时性能的差异)

(never mind the details, this is just a model that demonstrates the difference in performance with and without input_shape)

那么性能更好的模型中发生了什么?什么是multiple?各个层之间如何真正连接?在指定input_shape的同时如何构建相同的模型?

So what is happening in the better-performing model? What is multiple? How are the layers really connected? How could I build this same model while also specifying input_shape?

完整脚本:

import tensorflow as tf
from tensorflow import keras
import numpy as np
from collections import deque
import math, random

def func(x):
    return math.sin(x)*5 + math.sin(x*1.8)*4 + math.sin(x/4)*5

def get_data():
    x = 0
    dx = 0.1
    q = deque()
    r = 0
    data = np.zeros((100000, 1002), np.float32)
    while True:
        x = x + dx
        sig = func(x)
        q.append(sig)
        if len(q) < 1000:
            continue

        arr = np.array(q, np.float32)

        for k in range(10):
            xx = random.uniform(0.1, 9.9)
            data[r, :1000] = arr[:1000]
            data[r, 1000] = 5*xx #scale for easier fitting
            data[r, 1001] = func(x + xx)
            r = r + 1
            if r >= data.shape[0]:
                break

        if r >= data.shape[0]:
            break

        q.popleft()

    inputs = data[:, :1001]
    outputs = data[:, 1001]
    return (inputs, outputs)

np.random.seed(1)
tf.set_random_seed(1)
random.seed(1)

model = keras.Sequential()
model.add(keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(1001,)))
#                                   add or remove this  ^^^^^^^^^^^^^^^^^^^
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(1))

model.compile(
    loss = 'mse',
    optimizer = tf.train.RMSPropOptimizer(0.0005),
    metrics = ['mae', 'mse'])

inputs, outputs = get_data()

hist = model.fit(inputs, outputs, epochs=10, validation_split=0.1)

print("Final val_loss is", hist.history['val_loss'][-1])

推荐答案

TL; DR

结果不同的原因是因为两个模型的初始权重不同.一个人的表现(明显)好于另一个人的事实纯粹是偶然的,正如@today提到的,他们获得的结果大致相似.

TL;DR

The reason that the results are different is because the two models have different initial weights. The fact that one performs (significantly) better than the other is purely by chance and as @today mentioned the results they obtain are approximately similar.

正如 tf.set_random_seed 的文档所述,随机操作使用两个种子,即图级种子特定于操作的种子tf.set_random_seed设置图形级种子:

As the documentation for tf.set_random_seed explains, random operations use two seeds, the graph-level seed and the operation specific seed; tf.set_random_seed sets the graph-level seed:

依赖随机种子的操作实际上是从两个种子派生的:图级种子和操作级种子.这将设置图级种子.

Operations that rely on a random seed actually derive it from two seeds: the graph-level and operation-level seeds. This sets the graph-level seed.

看看Dense的定义,我们看到随机数生成器种子特定的操作(即权重初始化)的a>设置为None.现在,如果我们检查该种子的使用位置,我们会发现它已传递给 random_ops.truncated_normal .这反过来(与所有随机操作一样)现在获取两个种子,一个是图级种子,另一个是特定于操作的种子:

Taking a look at the definition for Dense we see that the default kernel initializer is 'glorot_uniform' (let's only consider the kernel initializer here but the same holds for the bias initializer). Walking farther through the source code we'll eventually find out that this fetches the GlorotUniform with default arguments. Specifically the random number generator seed for that specific operation (namely weight initialization) is set to None. Now if we check where this seed is used, we find it is passed to random_ops.truncated_normal for example. This in turn (as do all random operations) fetches now the two seeds, one being the graph-level seed and the other the operation specific seed: seed1, seed2 = random_seed.get_seed(seed). We can check the definition of the get_seed function and we find that if the operation specific seed is not given (which is our case) then it is derived from properties of the current graph: op_seed = ops.get_default_graph()._last_id. The corresponding part of the tf.set_random_seed docs read:

  1. 如果设置了图级别的种子,但未设置操作种子:系统确定性地选择一个操作种子和图级别的种子,以便获得唯一的随机序列.

现在回到原始问题,无论是否定义input_shape,它都对图结构产生影响.再次查看一些源代码,我们发现 Sequential.add 如果指定了input_shape,则仅 递增地构建网络的输入和输出;否则,它仅存储层列表(model._layers);否则,它仅存储层列表.比较两个定义的model.inputs, model.outputs.输出由逐步构建直接调用各层,这将调度到 Layer.__call__ .这个包装器构建了该层,设置了该层的输入和输出,并向输出中添加了一些元数据.还使用 对操作进行分组.我们可以从 Tensorboard 提供的可视化效果中看到这一点():

Now coming back to original problem, it makes a difference for the graph structure if input_shape is defined or not. Again looking at a bit of source code we find that Sequential.add builds the inputs and outputs of the network incrementally only if input_shape was specified; otherwise it just stores a list of layers (model._layers); compare model.inputs, model.outputs for the two definitions. The output is incrementally built by calling the layers directly which dispatches to Layer.__call__. This wrapper builds the layer, sets the layer's inputs and outputs and adds some metadata to the outputs; also it uses an ops.name_scope to group operations. We can see this from the visualization provided by Tensorboard (example for the simplified model architecture of Input -> Dense -> Dropout -> Dense):

现在,在我们未指定input_shape的情况下,模型所具有的只是一个层列表.即使在调用compile之后,该模型实际上也是 :通过 self.call(dummy_input_values, training=training) .检查此方法,我们发现它构建图层(请注意,尚未构建模型),然后

Now in the case we didn't specify input_shape all the model has is a list of layers. Even after having called compile the model is actually not compiled (just attributes such as the optimizer are set). Instead it is compiled "on the fly" when for the first time data is passed in to the model. This happens in in model._standardize_weights: the model output is obtained via self.call(dummy_input_values, training=training). Checking this method we find that it builds the layers (note that the model is not yet built) and then computes the output incrementally by using Layer.call (not __call__). This leaves out all the meta data and also the grouping of operations and hence results in a different structure of the graph (though its computational operations are all the same). Again checking Tensorboard we find:

扩展两个图,我们会发现它们包含相同的操作,不同的分组在一起.但是,这会导致keras.backend.get_session().graph._last_id的两个定义都不同,从而导致随机操作的种子不同:

Expanding both graphs we would find that they contain the same operations, grouped differently together. However this has the effect that the keras.backend.get_session().graph._last_id is different for both definitions and hence results in a different seed for the random operations:

# With `input_shape`:
>>> keras.backend.get_session().graph._last_id
303
# Without `input_shape`:
>>> keras.backend.get_session().graph._last_id
7

效果结果

为了进行类似的随机操作,我对OP的代码进行了一些修改:

Performance results

I used the OP's code with some modifications in order to have similar random operations:

  • Added the steps described here to ensure reproducibility in terms of randomization,
  • Set random seeds for Dense and Dropout variable initialization,
  • Removed validation_split since the splitting happens before "on the fly" compilation of the model without input_shape and hence might interfere with the seed,
  • Set shuffle = False since this might use a separate operation specific seed.

这是完整的代码(此外,我在运行脚本之前执行了export PYTHONHASHSEED=0):

This is the complete code (in addition I performed export PYTHONHASHSEED=0 before running the script):

from collections import deque
from functools import partial
import math
import random
import sys
import numpy as np
import tensorflow as tf
from tensorflow import keras


seed = int(sys.argv[1])

np.random.seed(1)
tf.set_random_seed(seed)
random.seed(1)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
                              inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
keras.backend.set_session(sess)


def func(x):
    return math.sin(x)*5 + math.sin(x*1.8)*4 + math.sin(x/4)*5


def get_data():
    x = 0
    dx = 0.1
    q = deque()
    r = 0
    data = np.zeros((100000, 1002), np.float32)
    while True:
        x = x + dx
        sig = func(x)
        q.append(sig)
        if len(q) < 1000:
            continue

        arr = np.array(q, np.float32)

        for k in range(10):
            xx = random.uniform(0.1, 9.9)
            data[r, :1000] = arr[:1000]
            data[r, 1000] = 5*xx #scale for easier fitting
            data[r, 1001] = func(x + xx)
            r = r + 1
            if r >= data.shape[0]:
                break

        if r >= data.shape[0]:
            break

        q.popleft()

    inputs = data[:, :1001]
    outputs = data[:, 1001]
    return (inputs, outputs)


Dense = partial(keras.layers.Dense, kernel_initializer=keras.initializers.glorot_uniform(seed=1))
Dropout = partial(keras.layers.Dropout, seed=1)

model = keras.Sequential()
model.add(Dense(64, activation=tf.nn.relu,
    # input_shape=(1001,)
))
model.add(Dropout(0.05))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dropout(0.05))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dropout(0.05))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dropout(0.05))
model.add(Dense(1))

model.compile(
    loss = 'mse',
    optimizer = tf.train.RMSPropOptimizer(0.0005)
)

inputs, outputs = get_data()
shuffled = np.arange(len(inputs))
np.random.shuffle(shuffled)
inputs = inputs[shuffled]
outputs = outputs[shuffled]

hist = model.fit(inputs, outputs[:, None], epochs=10, shuffle=False)
np.save('without.{:d}.loss.npy'.format(seed), hist.history['loss'])

使用此代码,我实际上希望两种方法都能获得相似的结果,但是事实证明它们并不相等:

With this code I'd actually expect to obtain similar results for both approaches however it turns out that they are not equal:

for i in $(seq 1 10)
do
    python run.py $i
done

绘制平均损耗+/- 1 std.开发人员:

Plot the mean loss +/- 1 std. dev.:

我验证了两个版本的初始权重和初始预测(拟合之前)是相同的:

I verified that the initial weights and an initial prediction (before fitting) is the same for the two versions:

inputs, outputs = get_data()

mode = 'without'
pred = model.predict(inputs)
np.save(f'{mode}.prediction.npy', pred)

for i, layer in enumerate(model.layers):
    if isinstance(layer, keras.layers.Dense):
        w, b = layer.get_weights()
        np.save(f'{mode}.{i:d}.kernel.npy', w)
        np.save(f'{mode}.{i:d}.bias.npy', b)

for i in 0 2 4 8
do
    for data in bias kernel
    do
        diff -q "with.$i.$data.npy" "without.$i.$data.npy"
    done
done

辍学的影响

[! ] 我在删除所有Dropout层后检查了性能,在这种情况下,性能实际上是相等的.因此,关键似乎在于Dropout层.实际上,没有Dropout层的模型的性能与带有 掉落层的模型的性能相同,但是没有的模型指定了input_shape.因此,如果没有input_shape,则Dropout层似乎无效.

Influence of Dropout

[ ! ] I checked the performance after removing all Dropout layers and in that case the performance is actually equal. So the crux seems to lie with the Dropout layers. Actually the performance of the models without Dropout layers is the same as for the model with Dropout layers but without specifying input_shape. So it seems that without input_shape the Dropout layers are not effective.

基本上,两个版本之间的区别在于,一个使用__call__,另一个使用call计算输出(如上所述).由于性能类似于没有Dropout层时的性能,可能的解释可能是未指定input_shape时Dropout层不会掉落.这可能是由training=False引起的,即,图层无法识别其处于训练模式.但是我看不出发生这种情况的原因.我们也可以再次考虑Tensorboard图.

Basically the difference between the two versions is that one uses __call__ and the other uses call to compute the outputs (as explained above). Since performance is similar to without Dropout layers a possible explanation could be that the Dropout layers don't drop when input_shape is not specified. This could by caused by training=False, i.e. the layers don't recognize they are in training mode. However I don't see a reason why this would happen. Also we can consider again the Tensorboard graphs.

指定input_shape:

未指定input_shape:

switch也取决于学习阶段(如前所述):

where the switch also depends on the learning phase (as before):

要验证training kwarg,让我们子类Dropout:

To verify the training kwarg let's subclass Dropout:

class Dropout(keras.layers.Dropout):
    def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
        super().__init__(rate, noise_shape=noise_shape, seed=1, **kwargs)

    def __call__(self, inputs, *args, **kwargs):
        training = kwargs.get('training')
        if training is None:
            training = keras.backend.learning_phase()
        print('[__call__] training: {}'.format(training))
        return super().__call__(inputs, *args, **kwargs)

    def call(self, inputs, training=None):
        if training is None:
            training = keras.backend.learning_phase()
        print('[call]     training: {}'.format(training))
        return super().call(inputs, training)

对于这两个版本,我都获得类似的输出,但是当未指定input_shape时,对__call__的调用会丢失:

I obtain similar outputs for both version, however the calls to __call__ are missing when input_shape is not specified:

[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call]     training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call]     training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call]     training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call]     training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)

所以我怀疑问题出在__call__内,但是现在我不知道是什么.

So I suspect that the problem lies somewhere within __call__ but right now I can't figure out what it is.

我正在通过conda(不支持GPU)使用Ubuntu 16.04,Python 3.6.7和Tensorflow 1.12.0:

I'm using Ubuntu 16.04, Python 3.6.7 and Tensorflow 1.12.0 via conda (no GPU support):

$ uname -a
Linux MyPC 4.4.0-141-generic #167-Ubuntu SMP Wed Dec 5 10:40:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ python --version
Python 3.6.7 :: Anaconda, Inc.
$ conda list | grep tensorflow
tensorflow                1.12.0          mkl_py36h69b6ba0_0
tensorflow-base           1.12.0          mkl_py36h3c3e929_0

编辑

我还安装了keraskeras-base(tensorflow需要keras-applicationskeras-preprocessing):

Edit

I also had keras and keras-base installed (keras-applications and keras-preprocessing are required by tensorflow):

$ conda list | grep keras
keras                     2.2.4                         0  
keras-applications        1.0.6                    py36_0  
keras-base                2.2.4                    py36_0  
keras-preprocessing       1.0.5                    py36_0

删除所有keras*tensorflow*,然后重新安装tensorflow,差异消失.即使重新安装keras,结果仍然相似.我还检查了另一个通过pip安装tensorflow的virtualenv;这里也没有差异.现在,我再也无法重现这种差异了.一定是张量流的破损安装.

After removing all, keras* and tensorflow*, then reinstalling tensorflow, the discrepancy vanished. Even after reinstalling keras the results remain similar. I also checked with a different virtualenv where tensorflow is installed via pip; also no discrepancy here. Right now I can't reproduce this discrepancy anymore. It must've been a broken installation of tensorflow.

这篇关于如果省略input_shape,Keras模型的结构是什么?为什么它会表现更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆