在监督分类中,带有partial_fit()的MLP比使用fit()的性能差 [英] MLP with partial_fit() performing worse than with fit() in a supervised classification

查看:543
本文介绍了在监督分类中,带有partial_fit()的MLP比使用fit()的性能差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用的学习数据集是灰度图像,该图像被展平,以使每个像素代表一个单独的样本。在对前一个分类器进行多层感知器 MLP )分类器训练之后,第二个图像将按像素进行分类。 / p>

我遇到的问题是, MLP 一次接收全部训练数据集时的性能会更好( fit())与通过块训练( partial_fit())进行比较。在这两种情况下,我都保留 Scikit-learn 提供的默认参数。



我在问这个问题因为当训练数据集处于数百万个样本的数量级时,我将不得不使用 partial_fit()来训练 MLP

  def批处理程序(数据,目标,块大小):
对于范围(0,len(data),chunksize)中的i:
收益数据[i:i + chunksize],target [i:i + chunksize]

def classify():
classifier = MLPClassifier(verbose = True)

#classifier.fit(training_data,training_target)

gen = batcherator(training.data,training.target,1000 )
代表chunk_data,chunk_target:
classifier.partial_fit(chunk_data,chunk_target,
classes = np.array([0,1]))

预测= classifier.predict(test_data)

我的问题我s我应该在 MLP 分类器中调整哪些参数,以使其在接受大量数据训练时其结果更易于接受?



我尝试使用 hidden_​​layer_sizes 增加隐藏层中神经元的数量,但没有发现任何改善。如果我使用<$ c将隐藏层的激活功能从默认 relu 更改为 logistic ,也没有任何改善$ c> activation 参数。



下面是我正在处理的图像(它们都是 512x512 图片),并链接到 Google Fusion 表,该表在其中从 CSV 导出为 CSV numpy 数组(将图像保留为浮点数而不是整数):



Training_data:



白色区域被屏蔽:



Class1:





Training_target:







Google Fusion Table(预测)

解决方案

TL,DR:进行多个循环您的数据具有较低的学习率和不同的观察顺序,并且您的 partial_fit 的效果与 fit 一样好。



partial_fit 有很多块的问题是,当您的模型完成最后一个块时,它可能会忘记第一个。这意味着,由于早期批次而引起的模型权重变化将被晚期批次完全覆盖。



但是,通过组合使用,可以很容易地解决此问题。的:


  1. 低学习率。如果模型学习缓慢,那么它也会慢慢忘记,并且早期批次不会被后期批次覆盖。 MLPClassifier 中的默认学习率是0.001,但是您可以将其更改为3或10的倍数,然后看看会发生什么。

  2. 多个纪元。如果学习速度慢,则所有训练样本的一个循环可能不足以使模型收敛。因此,您可以对训练数据进行多次循环,结果很可能会得到改善。直观的策略是通过降低学习率的相同因素来增加循环次数。

  3. 混洗观测。如果在您的数据中,狗的图像先于猫的图像,那么最终模型将记住的猫多于狗。但是,如果您以某种方式在批处理生成器中随机整理观测值,那将不是问题。最安全的策略是在每个时期之前重新洗改数据。


The learning dataset I'm using is a grayscale image that was flatten to have each pixel representing an individual sample. The second image will be classified pixel by pixel after training the Multilayer perceptron (MLP) classifier on the former one.

The problem I have is that the MLP is performing better when it receives the training dataset all at once (fit()) compared to when it is trained by chunks (partial_fit()). I'm keeping the default parameters provided by Scikit-learn in both cases.

I'm asking this question because when the training dataset is in the order of millions of samples, I will have to employ partial_fit() to train the MLP by chunks.

def batcherator(data, target, chunksize):
    for i in range(0, len(data), chunksize):
        yield data[i:i+chunksize], target[i:i+chunksize]

def classify():
    classifier = MLPClassifier(verbose=True)

    # classifier.fit(training_data, training_target)

    gen = batcherator(training.data, training.target, 1000)
    for chunk_data, chunk_target in gen:
        classifier.partial_fit(chunk_data, chunk_target,
                               classes=np.array([0, 1]))

    predictions = classifier.predict(test_data)

My question is which parameters should I adjust in the MLP classifier to make its results more acceptable when it's trained by chunks of data?

I've tried to increase the number of neurons in the hidden layer using hidden_layer_sizes but I didn't see any improvement. No improvement either if I change the activation function of the hidden layer from the default relu to logistic using the activation parameter.

Below are the images I'm working on (all of them are 512x512 images) with a link to the Google Fusion table where they were exported as CSV from the numpy arrays (to leave the image as a float instead of an int):

Training_data:

The white areas are masked out: Google Fusion Table (training_data)

Class0:

Class1:

Training_target:

Google Fusion Table (training_target)

Test_data:

Google Fusion Table (test_data)

Prediction (with partial_fit):

Google Fusion Table (predictions)

解决方案

TL,DR: make several loops over your data with small learning rate and different order of observations, and your partial_fit will perform as nice as fit.

The problem with partial_fit with many chunks is that when your model completes the last chunk, it may forget the first one. This means, changes in the model weights due to the early batches would be completely overwritten by the late batches.

This problem, however, can be solved easily enough with a combination of:

  1. Low learning rate. If model learns slowly, then it also forgets slowly, and the early batches would not be overwritten by the late batches. Default learning rate in MLPClassifier is 0.001, but you can change it by multiples of 3 or 10 and see what happens.
  2. Multiple epochs. If learning rate is slow, then one loop over all the training sample might be less than enough for model to converge. So you can make several loops over the training data, and result would most probably improve. The intuitive strategy is to increase yout number of loops by the same factor that you decrease the learning rate.
  3. Shuffling observations. If images of dogs go before images of cats in your data, then in the end model will remember more about cats than about dogs. If, however, you shuffle your observatons somehow in the batch generator, it will not be a problem. The safest strategy is to reshuffle the data anew before each epoch.

这篇关于在监督分类中,带有partial_fit()的MLP比使用fit()的性能差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆