在训练过程中如何纠正不稳定的损失和准确性? (二进制分类) [英] How to correct unstable loss and accuracy during training? (binary classification)

查看:205
本文介绍了在训练过程中如何纠正不稳定的损失和准确性? (二进制分类)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用tensorflow中新的keras API进行小型二进制分类项目.问题是几年前在Kaggle.com上发布的希格斯玻色子挑战的简化版本.数据集形状为2000x14,其中每行的前13个元素形成输入向量,第14个元素是相应的标签.这是上述数据集的示例:

I am currently working on a small binary classification project using the new keras API in tensorflow. The problem is a simplified version of the Higgs Boson challenge posted on Kaggle.com a few years back. The dataset shape is 2000x14, where the first 13 elements of each row form the input vector, and the 14th element is the corresponding label. Here is a sample of said dataset:

86.043,52.881,61.231,95.475,0.273,77.169,-0.015,1.856,32.636,202.068, 2.432,-0.419,0.0,0
138.149,69.197,58.607,129.848,0.941,120.276,3.811,1.886,71.435,384.916,2.447,1.408,0.0,1
137.457,3.018,74.670,81.705,5.954,775.772,-8.854,2.625,1.942,157.231,1.193,0.873,0.824,1

我对机器学习和张量流还比较陌生,但是我熟悉更高层次的概念,例如损失函数,优化器和激活函数.我曾尝试构建各种模型,这些模型的灵感来自在线发现的二进制分类问题,但我在训练模型时遇到了困难.在训练过程中,某些时间段内某些事物的损失增加,从而导致学习不稳定.准确性达到70%左右的稳定水平.我曾尝试更改学习率和其他超参数,但无济于事.相比之下,我对一个完全连接的前馈神经网络进行了硬编码,在同一问题上,该网络的准确度达到了80%至85%.

I am relatively new to machine learning and tensorflow, but I am familiar with the higher level concepts such as loss functions, optimizers and activation functions. I have tried building various models inspired by examples of binary classification problems found online, but I am having difficulties with training the model. During training, the loss somethimes increases within the same epoch, leading to unstable learning. The accuracy hits a plateau around 70%. I have tried changing the learning rate and other hyperparameters but to no avail. In comparison, I have hardcoded a fully-connected feed forward neural net that reaches around 80-85% accuracy on the same problem.

这是我当前的型号:

import tensorflow as tf
from tensorflow.python.keras.layers.core import Dense
import numpy as np
import pandas as pd

def normalize(array):
    return array/np.linalg.norm(array, ord=2, axis=1, keepdims=True)

x_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, :-1].values
y_train = pd.read_csv('data/labeled.csv', sep='\s+').iloc[:1800, -1:].values

x_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, :-1].values
y_test = pd.read_csv('data/labeled.csv', sep='\s+').iloc[1800:, -1:].values

x_train = normalize(x_train)
x_test = normalize(x_test)

model = tf.keras.Sequential()
model.add(Dense(9, input_dim=13, activation=tf.nn.sigmoid)
model.add(Dense(6, activation=tf.nn.sigmoid))
model.add(Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=50)
model.evaluate(x_test, y_test)

如前所述,某些时期开始时的准确性比结束时更高,从而导致学习不稳定.

As mentionned, some of the epochs start with a higher accuracy than they finish with, leading to unstable learning.

  32/1800 [..............................] - ETA: 0s - loss: 0.6830 - acc: 0.5938
1152/1800 [==================>...........] - ETA: 0s - loss: 0.6175 - acc: 0.6727
1800/1800 [==============================] - 0s 52us/step - loss: 0.6098 - acc: 0.6861
Epoch 54/250

  32/1800 [..............................] - ETA: 0s - loss: 0.5195 - acc: 0.8125
1376/1800 [=====================>........] - ETA: 0s - loss: 0.6224 - acc: 0.6672
1800/1800 [==============================] - 0s 43us/step - loss: 0.6091 - acc: 0.6850
Epoch 55/250

在如此简单的模型中学习中出现这些波动的原因可能是什么? 谢谢

What could be the cause of these oscillations in learning in such a simple model? Thanks

我遵循了评论中的一些建议,并相应地修改了模型.现在看起来更像这样:

I have followed some suggestions from the comments and have modified the model accordingly. It now looks more like this:

model = tf.keras.Sequential()
model.add(Dense(250, input_dim=13, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(200, activation=tf.nn.relu))
model.add(Dropout(0.4))
model.add(Dense(100, activation=tf.nn.relu))
model.add(Dropout(0.3))
model.add(Dense(50, activation=tf.nn.relu))
model.add(Dense(1, activation=tf.nn.sigmoid))

model.compile(optimizer='adadelta',
              loss='binary_crossentropy',
              metrics=['accuracy'])

推荐答案

振荡

这些绝对与您的网络规模有关;每一批通过的神经网络都发生了很大变化,因为它没有足够的神经元来表示这种关系.

Oscillations

Those are most definitely connected to the size of your network; each batch coming through changes your neural network considerably as it does not have enough neurons to represent the relationships.

对于一批来说,它工作正常,为另一批更新权重,并有效地取消学习"更改以前学习的连接.这就是为什么在网络尝试适应您分配给它的任务时,损失也非常大的原因.

It works fine for one batch, updates the weights for another and changes previously learned connections effectively "unlearning". That's why the loss is also jumpy as the network tries to accommodate to the task you have given it.

Sigmoid激活及其饱和可能也会给您带来麻烦(因为渐变被压缩到较小的区域,并且大多数渐变更新为零).快速修复-如下所述使用ReLU激活.

Sigmoid activation and it's saturation may be causing you troubles as well (as the gradient is squashed into small region and most gradient updates are zero). Quick fix - use ReLU activation as described below.

另外,神经网络并不关心准确性,仅关心使损失值最小化(它通常会尝试这样做).说它预测概率:[0.55, 0.55, 0.55, 0.55, 0.45]对于类[1, 1, 1, 1, 0],因此其准确度为100%,但不确定.现在,假设下一次更新将网络推入概率预测:[0.8, 0.8, 0.8, 0.8, 0.55].在这种情况下,损失会从100%降至80%但准确性也会降低.

Additionally, neural network does not care about accuracy, only about minimizing the loss value (which it tries to do most of the time). Say it predicts probabilities: [0.55, 0.55, 0.55, 0.55, 0.45] for classes [1, 1, 1, 1, 0] so it's accuracy is 100% but it's pretty uncertain. Now, let's say the next update pushes the network into probabilities predictions: [0.8, 0.8, 0.8, 0.8, 0.55]. In such case, loss would drop, but so would accuracy, from 100% to 80%.

顺便说一句..您可能要检查逻辑回归的分数,并查看其在此任务上的执行情况(因此仅包含输出的单个层).

BTW. You may want to check the scores for logistic regression and see how it performs on this task (so a single layer with output only).

从简单的模型开始并在需要时将其增大通常是一件好事(不会建议其他方法).您可能想检查一个很小的数据子样本(比如说两个/三个批次,大约160个元素),您的模型是否可以学习输入和输出之间的关系.

It's always good to start with simple model and grow it bigger if needed (wouldn't advise the other way around). You may want to check on a really small subsample of data (say two/three batches, 160 elements or so) whether your model can learn the relationship between input and output.

在您的情况下,我怀疑该模型是否能够了解与您提供的图层大小有关的那些关系.尝试增加大小,尤其是在较早的层中(对于初学者来说,可能是50/100),并观察其行为.

In your case I doubt the model will be able to learn those relationships with the size of layers you are providing. Try increasing the size, especially in the earlier layers (maybe 50/100 for starters) and see how it behaves.

Sigmoid容易饱和(发生变化的小区域,大多数值几乎为0或1).如今,它很少用作瓶颈(最终层)之前的激活.如今最常见的是 ReLU ,它不容易饱和(至少在输入为正时)或其变化形式.这也可能有帮助.

Sigmoid easily saturates (small region where changes occur, most of the values are almost 0 or 1). It is rarely used nowadays as activation before bottleneck (final layer). Most common nowadays is ReLU which is not prone to saturation (at least when the input is positive) or it's variations. This might help as well.

对于每个数据集和每个神经网络模型,学习率的最佳选择是不同的.默认值通常是这样工作的,但是当学习率太小时,它可能会卡在本地最小值中(并且泛化会更糟),而值太大时,则会使您的网络不稳定(损失会高度波动).

For each dataset and each neural network model optimal choice of learning rate is different. Defaults usually work so-so, but when the learning rate is too small it might get stuck in the local minima (and it's generalization will be worse), while the value being too big will make your network unstable (loss will highly oscillate).

您可能想阅读循环学习率 (或在Leslie N. Smith的原始研究论文中.您可以在其中找到有关如何选择的信息启发式地获得良好的学习率并设置一些简单的学习率调度程序,这些技术已由CIFAR10中的 fast.ai 团队使用比赛取得了非常好的成绩.在其网站上或在其图书馆的文档中,您可以找到One Cycle Policy和学习率查找器(基于上述研究人员的工作).我认为这应该可以帮助您开始这个领域.

You may want to read up on Cyclical Learning Rate (or in the original research paper by Leslie N. Smith. In there you can find info on how to choose a good learning rate heuristically and setup some simple learning rate schedulers. Those techniques were used by fast.ai teams in CIFAR10 competitions with really good results. On their site or in documentation of their library you can find One Cycle Policy and learning rate finder (based on the work of aforementioned researcher). This should get you started in this realm I think.

不确定,但是这种归一化对我来说似乎是非标准的(从未见过那样做).良好的规范化是神经网络收敛的基础(除非数据已经非常接近正态分布).通常,每个特征都减去平均值并除以标准差.例如,您可以在 scikit-learn中检查某些方案.

Not sure, but this normalization looks pretty non-standard to me (never seen it done like that). Good normalization is the basis for neural network convergence (unless the data is already pretty close to normal distribution). Usually one subtracts the mean and divides by standard deviation for each feature. You can check some schemes in scikit-learn library for example.

这不应该成为问题,但是如果您的输入很复杂,则应考虑在神经网络中添加更多的层(现在几乎肯定太薄了).这样一来,它就可以学习更多的抽象功能并更多地变换输入空间.

This shouldn't be an issue but if your input is complicated you should consider adding more layers to your neural network (right now it's almost definitely too thin). This would allow it to learn more abstract features and transform the input space more.

当网络过度适应数据时,您可能会采用一些正则化技术(很难说出有什么帮助,您应该自己进行测试),其中一些包括:

When the network overfits to the data you may employ some regularization techniques (hard to tell what might help, you should test it on your own), some of those include:

  • 通过批量归一化提高学习率,从而缩小了学习空间.
  • 神经元数量更少(通过网络学习的关系必须直观地具有更多的数据分布代表性.)
  • 较小的批次大小也具有正则化效果.
  • 辍学,尽管很难确定良好的辍学率.将诉诸于最后一个.此外,众所周知,它会与批处理规范化技术发生冲突(尽管有将它们组合在一起的技术,请参见此处或< a href ="https://stackoverflow.com/questions/39691902/ordering-of-batch-normalization-and-dropout">此处.
  • L1/L2正则化,第二个被更广泛地应用(除非您有特定的知识表明L1可能表现更好)
  • 数据增强-我会首先尝试这种方法,主要是出于好奇.由于您的特征是连续的,因此您可能希望逐个添加由高斯分布生成的随机噪声.噪声必须很小,标准偏差在1e-21e-3左右,您必须通过实验测试这些值.
  • 提前停止-在N个时期之后,如果没有对验证集进行任何改进,您将结束训练.几乎每次都应使用相当普遍的技术.请记住,将最佳模型保存在验证集上,并将patience(上面提到的N)设置为适当大小的值(不要将耐心性设置为1个历元左右,神经网络可能会在5个左右后容易改善).
  • Higher learning rate with batch normalization smoothing out learning space.
  • Smaller number of neurons (relationships learned by the network would intuitively have to be more data distribution representative).
  • Smaller batch size have regularization effect as well.
  • Dropout, though it's hard to pin-point good dropout rate. Would resort to it as the last one. Furthermore it is known to collide with batch normalization techniques (though there are techniques to combine them, see here or here, you may find more over the web).
  • L1/L2 regularization with the second being much more widely applied (unless you have specific knowledge indicating L1 might perform better)
  • Data augmentation - I would try this one first, mostly because of curiosity. As your features are continuous you may want to add some random noise on batch-to-batch basis generated from gaussian distribution. Noise would have to be small, standard deviation around 1e-2 or 1e-3, you would have to test those values experimentally.
  • Early stopping - after N epochs without improvement on the validation set you end your training. Pretty common technique, should be used almost every time. Remember to save the best model on validation set and set patience (N mentioned above) to some moderately sized value (do not set patience to 1 epoch or so, neural network may easily improve after 5 or so).

此外,您还可以找到许多其他技术.检查什么才最直观,最喜欢哪一个,并测试其性能.

Plus there are tons of other techniques you may find. Check what makes intuitive sense and which one you like the most and test how it performs.

这篇关于在训练过程中如何纠正不稳定的损失和准确性? (二进制分类)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆