在NaN之后,CNN模型的权重变为非常小的值 [英] Weights of CNN model go to really small values and after NaN

查看:141
本文介绍了在NaN之后,CNN模型的权重变为非常小的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练期间直到NaN之前,我无法理解以下模型的权重越来越小的原因.

I am not able to understand the reason why the weights of following model are going smaller and smaller until NaN during training.

模型如下:

def initialize_embedding_matrix(embedding_matrix):
    embedding_layer = Embedding(
        input_dim=embedding_matrix.shape[0],
        output_dim=embedding_matrix.shape[1],
        weights=[embedding_matrix],
        trainable=True)
    return embedding_layer

def get_divisor(x):
    return K.sqrt(K.sum(K.square(x), axis=-1))


def similarity(a, b):
    numerator = K.sum(a * b, axis=-1)
    denominator = get_divisor(a) * get_divisor(b)
    denominator = K.maximum(denominator, K.epsilon())
    return numerator / denominator


def max_margin_loss(positive, negative):
    loss_matrix = K.maximum(0.0, 1.0 + negative - Reshape((1,))(positive))
    loss = K.sum(loss_matrix, axis=-1, keepdims=True)
    return loss


def warp_loss(X):
    z, positive_entity, negatives_entities = X
    positiveSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(1,), name="positive_sim")([z, positive_entity])
    z_reshaped = Reshape((1, z.shape[1].value))(z)
    negativeSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(negatives_titles.shape[1].value, 1,), name="negative_sim")([z_reshaped, negatives_entities])
    loss = Lambda(lambda x: max_margin_loss(x[0], x[1]), output_shape=(1,), name="max_margin")([positiveSim, negativeSim])
    return loss

def mean_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)

def build_nn_model():
    wl, tl = load_vector_lookups()
    embedded_layer_1 = initialize_embedding_matrix(wl)
    embedded_layer_2 = initialize_embedding_matrix(tl)

    sequence_input_1 = Input(shape=(_NUMBER_OF_LENGTH,), dtype='int32',name="text")
    sequence_input_positive = Input(shape=(1,), dtype='int32', name="positive")
    sequence_input_negatives = Input(shape=(10,), dtype='int32', name="negatives")

    embedded_sequences_1 = embedded_layer_1(sequence_input_1)
    embedded_sequences_positive = Reshape((tl.shape[1],))(embedded_layer_2(sequence_input_positive))
    embedded_sequences_negatives = embedded_layer_2(sequence_input_negatives)

    conv_step1 = Convolution1D(
        filters=1000,
        kernel_size=5,
        activation="tanh",
        name="conv_layer_mp",
        padding="valid")(embedded_sequences_1)

    conv_step2 = GlobalMaxPooling1D(name="max_pool_mp")(conv_step1)
    conv_step3 = Activation("tanh")(conv_step2)
    conv_step4 = Dropout(0.2, name="dropout_mp")(conv_step3)
    z = Dense(wl.shape[1], name="predicted_vec")(conv_step4) # activation="linear"

    loss = warp_loss([z, embedded_sequences_positive, embedded_sequences_negatives])
    model = Model(
        inputs=[sequence_input_1, sequence_input_positive, sequence_input_negatives],
        outputs=[loss]
        )
    model.compile(loss=mean_loss, optimizer=Adam())
    return model

model = build_nn_model()
x, y_real, y_fake = load_x_y()
    X_train = {
    'text': x_train,
    'positive': y_real_train,
    'negatives': y_fake_train
}

model.fit(x=X_train,  y=np.ones(len(x_train)), batch_size=10, shuffle=True, validation_split=0.1, epochs=10)

要稍微描述一下模型:

  • 我有两个预训练的嵌入(wltl),并使用这些值初始化Keras嵌入.
  • 有3个输入. sequence_input_1具有整数作为输入(单词的索引,例如[42, 32 .., 4]).在它们上面,sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH)用于具有固定长度.对于每个示例,sequence_input_positive是正输出的整数,而sequence_input_negatives是N个随机负输出(在上面的代码中为10).
  • max_margin_loss衡量cosinus_similarity(positive_example, sequence_input_1)cosinus_similarity(negative_example[i], sequence_input_1)之间的差异,并使用Adam优化器将损失降到最低.
  • I have two pre-trained embeddings(wl,tl) and I initialize the Keras embeddings with these values.
  • There are 3 inputs. The sequence_input_1 has integers as input (indexes of words. ex. [42, 32 .., 4]). On them sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH) is used to have fixed length. sequence_input_positive which is an integer of the positive output and sequence_input_negatives which are N random negative outputs (10 in the code above) for each example.
  • max_margin_loss measures the difference between the cosinus_similarity(positive_example, sequence_input_1) andcosinus_similarity(negative_example[i], sequence_input_1) and the Adam optimizer is used to minimize loss.

尽管仅使用20个数据点训练该模型,但Convolution1D中的权重和Dense中的权重均归NaN所限.如果我添加更多数据点,则嵌入权重也将移至NaN.我可以观察到,随着模型的运行,权重越来越小,直到达到NaN.还值得注意的是,损失不会归结到NaN.当权重达到NaN时,损失将变为零.

While training this model even with only 20 data points the weights in the Convolution1D and Dense goes to NaN. If I add more data points the embedding weights go to NaN too. I can observe that as the model runs the weights are going smaller and smaller until they go to NaN. Something noticable also is that the loss does not go to NaN. When weights reach NaN, the loss goes to zero.

我无法找到问题所在.

I am unable to find what is going wrong.

这是我到目前为止尝试过的:

This is what I tried until now:

  • 我已经看到,使用铰链损耗时,人们正在使用随机梯度下降.使用SGD优化程序不会改变此处的行为.
  • 更改了批量大小的数量.行为没有变化.
  • 检查的输入数据不具有nan值.
  • 将输入矩阵(预先训练的数据)归一化以嵌入np.linalg.norm
  • 将预训练矩阵从float64转换为float32
  • I have seen that people are using stochastic gradient descent when hinge loss is used. Using SGD optimizer didn't change something in the behavior here.
  • changed the number of batch size. No change in behavior.
  • checked input data not to have nan values.
  • normalized the input matrix (pre-trained data) for embedding with np.linalg.norm
  • transform pre-trained matrix from float64 to float32

您在模型的架构中看到任何奇怪的地方吗?如果不是这样:我无法找到一种调试体系结构的方法,以了解为什么权重越来越小直到达到NaN.人们注意到这种行为时会使用一些步骤吗?

Do you see anything strange in the architecture of the model? If not: I am unable to find a way to debug the architecture in order to understand why weights are going smaller and smaller till reach NaN. Is there some steps people are using when they notice this kind of behaviour?

修改:

通过在嵌入中使用trainable=False,未观察到nan权重的行为,并且训练似乎具有平稳的结果.但是我希望嵌入是可训练的.那么,为什么嵌入是可训练的,这种行为呢?

By using trainable=False in the Embeddings this behaviour of nan weights is NOT observed, and the training seems to have smooth results. However I want the embeddings to be trainable. So why this behavior when the embeddings are trainable??

Edit2 :

使用trainable=True并通过均匀随机地初始化权重embeddings_initializer='uniform',训练很流畅.因此发生的原因是我的单词嵌入.我已经检查了我的预训练单词嵌入,并且没有NaN值.我也将它们归一化,以防造成这种情况,但也不乏.不能再考虑为什么这些特定的权重赋予了这种行为.

Using trainable=True and by uniformly randomly initializing the weights embeddings_initializer='uniform' the training is smooth. So the reason happening is my word embeddings. I have checked my pre-trained word embeddings and there are no NaN values. I have also normalized them in case this was causing it but no lack. Cant think anything else why these specific weights are giving this behaviour.

Edit3 :

似乎造成这种情况的原因是,在gensim中训练的一个嵌入中有很多行都是全零.例如.

It seems that what causing this was that a lot of rows from one of the Embeddings trained in gensim where all zeros. ex.

[0.2, 0.1, .. 0.3],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.2, 0.1, .. 0.1]

要找到它并不是一件容易的事,因为嵌入物的尺寸确实很大.

It was not so easy to find it as the dimension of the embeddings where really big.

如果有人想出类似的东西或想回答上述问题,我将保留这个问题:人们在注意到这种行为时是否正在使用某些步骤?"

I am leaving this question open in case someone comes up with something similar or wants to answer the question asked above: "Is there some steps people are using when they notice this kind of behaviour?"

推荐答案

通过您的编辑,查找问题变得容易一些.

By your edits, it got a little easier to find the problem.

那些零将不变地传递给warp_loss函数. 最初经过卷积的部分保持不变,因为任何滤波器乘以零都将得出零,并且默认的偏差初始值设定项也是'zeros'.相同的想法适用于密集型(过滤器* 0 = 0和bias Initializer ='zeros')

Those zeros passed unchanged to the warp_loss function. The part that went through the convolution remained unchanged at first, because any filters multiplied by zero result in zero, and the default bias initializer is also 'zeros'. The same idea applies to the dense (filters * 0 = 0 and bias initializer = 'zeros')

到达以下一行:return numerator / denominator并导致错误(除以零)

That reached this line: return numerator / denominator and caused an error (division by zero)

这是我在许多代码中常见的做法,添加K.epsilon()可以避免这种情况:

It's a common practice I've seen in many codes to add K.epsilon() to avoid this:

return numerator / (denominator + K.epsilon())

这篇关于在NaN之后,CNN模型的权重变为非常小的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆