partial_fit 与 SGDClassifier 给出了波动的准确度 [英] partial_fit with SGDClassifier gives fluctuating accuracy

查看:85
本文介绍了partial_fit 与 SGDClassifier 给出了波动的准确度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据在一个稀疏矩阵中.在开始大计算之前,我现在首先处理具有约 500k 行的子集.数据是二元数加上熵和字符串长度,完整的数据集包含数百万行乘以 1400 列.该模型旨在帮助表征这些字符串,因此我使用 SGDClassifier 进行逻辑回归.

I have my data in a sparse matrix. I work now first on a subset with ~500k rows before starting the big computation. The data is bigram counts plus entropy and string length, and the complete dataset contains 100s of millions of rows times 1400 columns. The model is meant to help characterise these strings, so I use SGDClassifier for logistic regression.

由于尺寸较大,我决定在我的 SGDClassifier 上使用 partial_fit,但是我得到的计算出的 area-under-curve 值每个时代似乎波动很大.

Because of the large size I decided to use partial_fit on my SGDClassifier, but the calculated area-under-curve value I get at each epoch seems to fluctuate a lot.

这是我的代码:

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
model = SGDClassifier(loss='log', alpha=1e-10, n_iter=50, n_jobs=-1, shuffle=True)
for f in file_list:
    data = dill.load(open(f))
    X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
    X_train, X_holdout, y_train, y_holdout = train_test_split(data, labels, test_size=0.05)
    for ep in range(max_epoch):
        model.partial_fit(X_train, y_train, classes=np.unique(y_train))

        # Calculate Area under ROC curve to see if things improve
        probs = model.predict_proba(X_holdout)
        auc   = roc_auc_score(y_holdout, [x[1] for x in probs])

        if auc > best_auc: best_auc = auc
        print('Epoch: %d - auc: %.2f (best %.2f)' %(ep, auc, best_auc))

结果是 auc 迅速上升到 ~0.9,但随后波动很大.有时它甚至会下降到 ~0.5-0.6,然后又回升.我认为更合乎逻辑的 auc 应该随着每个 epoch 通常继续增加,只有很小的下降可能,直到它找到一个平衡值,其中更多的训练几乎没有任何改善.

What happens is that auc quickly goes up to ~0.9 but then fluctuates alot. Sometimes it drops to ~0.5-0.6 even and then back up. I thought that more logically auc should continue to generally increase with each epoch, with only small dips possible, until it finds a equilibrium value where more training hardly improve anything.

我做错了什么,或者这是否是 partial_fit 可能的正常"行为?当我在较小的数据集上使用 fit 时,我从未见过这种行为.

Is there anything I am doing wrong, or is this a possible "normal" behaviour with partial_fit? I never saw this behaviour when I used fit on the smaller dataset.

推荐答案

通常,partial_fit 容易出现减少波动的准确性.在某种程度上,可以通过改组并提供整个数据集的一小部分来稍微缓解这种情况.但是,对于较大的数据,使用 SGDClassifier/SVM 分类器,在线训练似乎只会降低准确性.

Usually, partial_fit has seen to be prone to reduction or fluctuation in accuracy. To some extent, this can be slightly mitigated by shuffling and providing only small fractions of the entire dataset. But, for larger data, online training only seems to give reducing accuracies, with SGDClassifier/SVM Classifier.

我尝试对其进行试验,发现使用低学习率有时可以帮助我们.粗略的类比是,在大数据上重复训练同一个模型,导致模型忘记从以前的数据中学到的东西.因此,使用微小学习率会减慢学习和遗忘的速度!

I tried to experiment with it and discovered that using a low learning rate can help us sometimes. The rough analogy is, on training the same model on large data repeteadly, leads to the model forgetting what it learnt from the previous data. So, using a tiny learning rate slows down the rate of learning as well as forgetting!

我们可以使用 sklearn 提供的 adaptive 学习率功能,而不是手动提供速率.注意模型初始化部分,

Rather than manually providing a rate, we can use adaptive learning rate functionality provided by sklearn. Notice the model initialisation part,

model = SGDClassifier(loss="hinge", penalty="l2", alpha=0.0001, max_iter=3000, tol=None, shuffle=True, verbose=0, learning_rate='adaptive', eta0=0.01, early_stopping=False)

这在 [scikit 文档] 中描述为:

This is described in the [scikit docs] as:

‘adaptive’:eta = eta0,只要训练量不断减少.每次 n_iter_no_change 连续 epoch 未能将训练损失减少 tol 或未能将验证分数增加 tol 如果 early_stopping 为 True 时,则当前学习率除以 5.

‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.

随着学习率的变化,我得到了非常好的结果(从最初的数据集第四部分的 98% 下降到 28%)到 100% 的模型准确率.

I got really good results (from initial drop from 98% to 28% in fourth part of the dataset) to 100% model accuracy with the change in learning rate.

这篇关于partial_fit 与 SGDClassifier 给出了波动的准确度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆