使用joblib.dump保存和加载经过训练的GradientBoostingClassifier [英] Saving and loading a trained GradientBoostingClassifier using joblib.dump

查看:181
本文介绍了使用joblib.dump保存和加载经过训练的GradientBoostingClassifier的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过以下代码使用joblib.dump保存受过训练的GradientBoostingClassifier:

I am trying to save a trained GradientBoostingClassifier using joblib.dump using the following code:

# use 90% of training data
NI=int(len(X_tr)*0.9) 
I1=np.random.choice(len(X_tr),NI)
Xi=X_tr[I1,:]
Yi=Y_tr[I1]

#train a GradientBoostingCalssifier using that data

a=GradientBoostingClassifier(learning_rate=0.02, n_estimators=500, min_samples_leaf=50,presort=True,warm_start=True)

 a.fit(Xi,Yi) 

# calculate class probabilities for the remaining data

I2=np.array(list(set(range(len(X_tr)))-set(I1)))
Pi=np.zeros(len(X_tr))
Pi[I2]=a.predict_proba(X_tr[I2,:])[:,1].reshape(-1)

#save indexes of training data and the predicted probabilites
np.savetxt('models\\balanced\\GBT1\\oob_index'+str(j)+'.txt',I2)
np.savetxt('models\\balanced\\GBT1\\oob_m'+str(j)+'.txt',Pi)

# save the trained classifier
joblib.dump(a, 'models\\balanced\\GBT1\\m'+str(j)+'.pkl') 

训练并保存分类器后,我关闭了终端,打开了一个新终端,然后运行以下代码来加载分类器并在已保存的测试数据集中对其进行测试

Once the classifier is trained and saved, I closed the terminal, opened a new terminal and run the following code to load the classifier and test it on the saved test dataset

    # load the saved class probabilities 
    Pi=np.loadtxt('models\\balanced\\GBT1\\oob_m'+str(j)+'.txt') 

    #load the training data index 
    Ii=np.loadtxt('models\\balanced\\GBT1\\oob_index'+str(j)+'.txt')

    #load the trained model
    a=joblib.load('models\\balanced\\GBT1\\m'+str(j)+'.pkl')

    #predict class probabilities using the trained model
    Pi1=a.predict_proba(X_tr[Ii,:])[:,1] 

    # Calculate aupr for the retrained model 
    _prec,_rec,_=metrics.precision_recall_curve(Y[Ii],Pi1,pos_label=1)
    auc=metrics.auc(_rec,_prec);

    # calculate aupr for the saved probabilities
    _prec1,_rec1,_=metrics.precision_recall_curve(Y[Ii],Pi[Ii],pos_label=1)
    auc1=metrics.auc(_rec1,_prec1);

     print('in iteration ', j, ' aucs: ', auc, auc1)

该代码显示以下内容:迭代0 aucs中:0.0331879 0.0657821...............................在所有情况下,重新加载的分类器的aupr与原始训练的分类器明显不同.我正在使用相同版本的sklearn和python进行加载和保存.我在做什么错了?

The code prints the following: in iteration 0 aucs: 0.0331879 0.0657821 ............................... In all cases, the aupr for reloaded classifier is significantly different from the original trained classifier. I am using the same version of sklearn and python for loading and saving. What am I doing wrong?

推荐答案

该错误在您的代码中.我建议您使用 train_test_split 拆分数据.它通过默认

The error is in your code. I advise you split your data using train_test_split. It shuffles the data by default

下面的代码对于 auc 指标产生相同的结果:

The code below produces the same result for auc metrics:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pickle
from sklearn.externals import joblib

def main():
    X, y = load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.3)

    clf = GradientBoostingClassifier()
    clf.fit(X_train, y_train)

    preds = clf.predict(X_test)
    prec, rec, _ = precision_recall_curve(y_test, preds, pos_label=1)

    with open('dump.pkl', 'wb') as f:
        pickle.dump(clf, f)

    print('AUC SCORE: ', auc(rec, prec))

    clf2 = joblib.load('dump.pkl')
    preds2 = clf2.predict(X_test)

    prec2, rec2, _ = precision_recall_curve(y_test, preds2, pos_label=1)

    print('AUC SCORE AFTER DUMP: ', auc(rec2, prec2))

if __name__ == '__main__':
    main()


>>> AUC SCORE: 0.273271889401
>>> AUC SCORE AFTER DUMP: 0.273271889401

这篇关于使用joblib.dump保存和加载经过训练的GradientBoostingClassifier的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆