重新载入后,Keras模型参数全为"NaN" [英] Keras model params are all "NaN"s after reloading

查看:193
本文介绍了重新载入后,Keras模型参数全为"NaN"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Resnet50上使用转移学习.我从Keras提供的预训练模型(图像网络")中创建了一个新模型.

I use transfer learning with Resnet50. I create a new model out of the pretrained model provided by Keras (the 'imagenet').

训练完新模型后,我将其保存如下:

After training my new model, I save it as following:

# Save the Siamese Network architecture
siamese_model_json = siamese_network.to_json()
with open("saved_model/siamese_network_arch.json", "w") as json_file:
    json_file.write(siamese_model_json)
# save the Siamese Network model weights
siamese_network.save_weights('saved_model/siamese_model_weights.h5')

然后,我按如下方式重新加载它以做出一些预测:

And later, I reload it as following to make some predictions:

json_file = open('saved_model/siamese_network_arch.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
siamese_network = model_from_json(loaded_model_json)
# load weights into new model
siamese_network.load_weights('saved_model/siamese_model_weights.h5')

然后我检查权重是否看起来合理(如下所示)(从第1层开始):

Then I check if the weights look reasonable as following (from 1 of the layers):

print("bn3d_branch2c:\n",
      siamese_network.get_layer('model_1').get_layer('bn3d_branch2c').get_weights())

如果仅将我的网络训练1个纪元,那我会看到合理的值.

If I train my network for 1 epoch only, I see reasonable values there..

但是,如果我训练模型18个纪元(因为我的计算机速度很慢,这需要5-6个小时),我只会看到NaN值,如下所示:

But if I train my model for 18 epochs (which takes 5-6 hours as I have a very slow computer), I just see NaN values as following:

bn3d_branch2c:
 [array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       ...

这里的窍门是什么?

附录1:

这是我创建模型的方式.

Here is how I create my model.

在这里,我有一个Triplet_loss函数,稍后我会用到.

Here, I have a triplet_loss function that I will need later on.

def triplet_loss(inputs, dist='euclidean', margin='maxplus'):
    anchor, positive, negative = inputs
    positive_distance = K.square(anchor - positive)
    negative_distance = K.square(anchor - negative)
    if dist == 'euclidean':
        positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims=True))
        negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims=True))
    elif dist == 'sqeuclidean':
        positive_distance = K.sum(positive_distance, axis=-1, keepdims=True)
        negative_distance = K.sum(negative_distance, axis=-1, keepdims=True)
    loss = positive_distance - negative_distance
    if margin == 'maxplus':
        loss = K.maximum(0.0, 2 + loss)
    elif margin == 'softplus':
        loss = K.log(1 + K.exp(loss))

    returned_loss = K.mean(loss)
    return returned_loss

这是我从头到尾构建模型的方式.我给出完整的代码以给出确切的图片.

And here is how I construct my model from start to end. I give the complete code to give the exact picture.

model = ResNet50(weights='imagenet')

# Remove the last layer (Needed to later be able to create the Siamese Network model)
model.layers.pop()

# First freeze all layers of ResNet50. Transfer Learning to be applied.
for layer in model.layers:
    layer.trainable = False

# All Batch Normalization layers still need to be trainable so that the "mean"
# and "standard deviation (std)" params can be updated with the new training data
model.get_layer('bn_conv1').trainable = True
model.get_layer('bn2a_branch2a').trainable = True
model.get_layer('bn2a_branch2b').trainable = True
model.get_layer('bn2a_branch2c').trainable = True
model.get_layer('bn2a_branch1').trainable = True
model.get_layer('bn2b_branch2a').trainable = True
model.get_layer('bn2b_branch2b').trainable = True
model.get_layer('bn2b_branch2c').trainable = True
model.get_layer('bn2c_branch2a').trainable = True
model.get_layer('bn2c_branch2b').trainable = True
model.get_layer('bn2c_branch2c').trainable = True
model.get_layer('bn3a_branch2a').trainable = True
model.get_layer('bn3a_branch2b').trainable = True
model.get_layer('bn3a_branch2c').trainable = True
model.get_layer('bn3a_branch1').trainable = True
model.get_layer('bn3b_branch2a').trainable = True
model.get_layer('bn3b_branch2b').trainable = True
model.get_layer('bn3b_branch2c').trainable = True
model.get_layer('bn3c_branch2a').trainable = True
model.get_layer('bn3c_branch2b').trainable = True
model.get_layer('bn3c_branch2c').trainable = True
model.get_layer('bn3d_branch2a').trainable = True
model.get_layer('bn3d_branch2b').trainable = True
model.get_layer('bn3d_branch2c').trainable = True
model.get_layer('bn4a_branch2a').trainable = True
model.get_layer('bn4a_branch2b').trainable = True
model.get_layer('bn4a_branch2c').trainable = True
model.get_layer('bn4a_branch1').trainable = True
model.get_layer('bn4b_branch2a').trainable = True
model.get_layer('bn4b_branch2b').trainable = True
model.get_layer('bn4b_branch2c').trainable = True
model.get_layer('bn4c_branch2a').trainable = True
model.get_layer('bn4c_branch2b').trainable = True
model.get_layer('bn4c_branch2c').trainable = True
model.get_layer('bn4d_branch2a').trainable = True
model.get_layer('bn4d_branch2b').trainable = True
model.get_layer('bn4d_branch2c').trainable = True
model.get_layer('bn4e_branch2a').trainable = True
model.get_layer('bn4e_branch2b').trainable = True
model.get_layer('bn4e_branch2c').trainable = True
model.get_layer('bn4f_branch2a').trainable = True
model.get_layer('bn4f_branch2b').trainable = True
model.get_layer('bn4f_branch2c').trainable = True
model.get_layer('bn5a_branch2a').trainable = True
model.get_layer('bn5a_branch2b').trainable = True
model.get_layer('bn5a_branch2c').trainable = True
model.get_layer('bn5a_branch1').trainable = True
model.get_layer('bn5b_branch2a').trainable = True
model.get_layer('bn5b_branch2b').trainable = True
model.get_layer('bn5b_branch2c').trainable = True
model.get_layer('bn5c_branch2a').trainable = True
model.get_layer('bn5c_branch2b').trainable = True
model.get_layer('bn5c_branch2c').trainable = True

# Used when compiling the siamese network
def identity_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)  

# Create the siamese network

x = model.get_layer('flatten_1').output # layer 'flatten_1' is the last layer of the model
model_out = Dense(128, activation='relu',  name='model_out')(x)
model_out = Lambda(lambda  x: K.l2_normalize(x,axis=-1))(model_out)

new_model = Model(inputs=model.input, outputs=model_out)

anchor_input = Input(shape=(224, 224, 3), name='anchor_input')
pos_input = Input(shape=(224, 224, 3), name='pos_input')
neg_input = Input(shape=(224, 224, 3), name='neg_input')

encoding_anchor   = new_model(anchor_input)
encoding_pos      = new_model(pos_input)
encoding_neg      = new_model(neg_input)

loss = Lambda(triplet_loss)([encoding_anchor, encoding_pos, encoding_neg])

siamese_network = Model(inputs  = [anchor_input, pos_input, neg_input], 
                        outputs = loss) # Note that the output of the model is the 
                                        # return value from the triplet_loss function above

siamese_network.compile(optimizer=Adam(lr=.0001), loss=identity_loss)

要注意的一件事是,我将所有批处理规范化层都设置为可训练的",以便可以使用我的训练数据来更新BN相关的参数.这样会产生很多行,但是我找不到更短的解决方案.

One thing to notice is that I make all batch normalization layers "trainable" so that BN related params can be updated with my training data. This creates a lot of lines but I could not find a shorter solution.

推荐答案

该解决方案的灵感来自@Gurmeet Singh的上述建议.

The solution is inspired from @Gurmeet Singh's recommendation above.

在训练过程中,可训练图层的权重似乎过了一会儿就变得很大,所有这些权重都设置为NaN,这让我觉得我以错误的方式保存并重新加载了模型,但问题是爆炸性的

Seemingly, weights of trainable layers have become so big after a while during the training and all such weights are set to NaN, which made me think that I was saving and reloading my models in the wrong way but the problem was exploding gradients.

我在github讨论中也看到了类似的问题,可以在这里查看:github.com/keras-team/keras/issues/2378 在github中该线程的底部,建议使用较低的学习率来避免该问题.

I saw a similar issue in github discussions too, which can be checked out here: github.com/keras-team/keras/issues/2378 At the bottom of that thread in github, it is recommended to use lower learning rates to avoid the problem.

在此链接中( Keras ML库:梯度更新后如何进行权重裁剪?TensorFlow后端),讨论了2种解决方案: -使用优化程序中的 clipvalue 参数,该参数将按配置简单地剪切计算出的梯度值.但这不是推荐的解决方案.(在另一个线程中进行了解释.) -第二件事是使用clipnorm参数,当用户的L2范数超过给定值时,它会简单地剪切计算出的渐变值.

In this link (Keras ML library: how to do weight clipping after gradient updates? TensorFlow backend), 2 solutions are discussed: - using the clipvalue parameter in the optimizer, which simply cuts the calculated gradient values as configured. But this is not the recommended solution to go for.(Explained in the other thread.) - and the second thing is to use the clipnorm parameter, which simply clips calculated gradient values when their L2 norm exceeds the given value by the user.

我还考虑过使用输入归一化(以避免梯度梯度化),但是后来发现它已经在 preprocess_input(..)函数中完成了. (请查看此链接以获取详细信息: https://www.tensorflow .org/api_docs/python/tf/keras/applications/resnet50/preprocess_input )虽然可以将 mode 参数设置为"tf" (设置否则默认为"caffe" ),这可能会进一步提供帮助(因为 mode ="tf" 设置可将像素缩放为-1和1之间),但我没有尝试.

I also thought about using input normalization (to avoid exploding gradients) but then figured out that it is already done in the preprocess_input(..) function. (Check this link for details: https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/preprocess_input) It is though possible to set the mode parameter to "tf" (set to "caffe" by default otherwise), which could further help (because mode="tf" setting scales pixels between -1 and 1) but I did not try it.

总之,在编译将要训练的模型时,我做了两件事:

I summary, I changed 2 things when compiling my model that will be trained:

已更改的行如下:

更改前:

siamese_network.compile(optimizer=Adam(**lr=.0001**), 
                        loss=identity_loss)

更改后:

siamese_network.compile(optimizer=Adam(**lr=.00004**, **clipnorm=1.**),
                        loss=identity_loss)

1)使用较小的学习率使梯度更新稍微小一些 2)使用clipnorm参数归一化计算出的梯度并将其剪切.

1) Used a smaller learning rate to make gradient updates a bit smaller 2) Used the clipnorm parameter to normalize calculated gradients and cut them.

然后我再次训练了我的网络10个纪元.损耗按需要减少,但现在更加缓慢.在保存和存储模型时,我没有遇到任何问题. (至少经过10个时间段(在我的计算机上需要时间).)

And I trained my network again for 10 epochs. The loss decreases as desired, but more slowly now. And I do not experience any problems when saving and storing my model. (At least after 10 epochs (it takes time on my computer).)

请注意,我将 clipnorm 的值设置为 1 .这意味着,首先计算梯度的L2范数,如果计算出的归一化梯度超过"1"的值,则将剪切该梯度.我认为这是一个可以优化的超参数,它会影响模型训练所需的时间,同时有助于避免爆炸梯度问题.

Note that I set the value of clipnorm to 1. This means that the L2 norm of gradients is calculated first and if the calculated normalized gradient exceeds the value of "1", the gradient is clipped. I assume this is a hyperparameter that can be optimized, that affects the time needed to train the model while helping to avoid exploding gradients problem.

这篇关于重新载入后,Keras模型参数全为"NaN"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆