使用python sklearn进行增量随机森林模型训练 [英] Incremental training of random forest model using python sklearn

查看:964
本文介绍了使用python sklearn进行增量随机森林模型训练的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码保存随机森林模型.我正在使用cPickle保存经过训练的模型.当我看到新数据时,是否可以逐步训练模型. 目前,这套火车有大约2年的数据.有没有一种方法可以再培训2年,并将其附加到现有的已保存模型中.

I am using the below code to save a random forest model. I am using cPickle to save the trained model. As I see new data, can I train the model incrementally. Currently, the train set has about 2 years data. Is there a way to train on another 2 years and (kind of) append it to the existing saved model.

rf =  RandomForestRegressor(n_estimators=100)
print ("Trying to fit the Random Forest model --> ")
if os.path.exists('rf.pkl'):
    print ("Trained model already pickled -- >")
    with open('rf.pkl', 'rb') as f:
        rf = cPickle.load(f)
else:
    df_x_train = x_train[col_feature]
    rf.fit(df_x_train,y_train)
    print ("Training for the model done ")
    with open('rf.pkl', 'wb') as f:
        cPickle.dump(rf, f)
df_x_test = x_test[col_feature]
pred = rf.predict(df_x_test)

我没有计算能力,无法一次对4年的数据进行训练.

EDIT 1: I don't have the compute capacity to train the model on 4 years of data all at once.

推荐答案

在sklearn

尽管并非所有算法都能逐步学习(即 一次查看所有实例),所有实现 可以使用partial_fit API.其实,学习的能力 从一小批实例(有时称为在线" 学习")是核心学习的关键,因为它可以确保 在给定的时间中,主节点中只会有少量实例 记忆.

Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called "online learning") is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory.

它们包括实现partial_fit()的分类器和回归器的列表,但是RandomForest不在其中.您也可以在文档页面上确认RFRegressor没有实现部分拟合在文档页面上为RandomForestRegressor .

They include a list of classifiers and regressors implementing partial_fit(), but RandomForest is not among them. You can also confirm RFRegressor does not implement partial fit on the documentation page for RandomForestRegressor.

一些可能的前进方式:

  • 使用确实实现了partial_fit()的回归器,例如SGDRegressor
  • 检查您的RandomForest模型的feature_importances_属性,然后在删除不重要的功能后使用3或4年的数据重新训练模型
  • 如果只能使用两年,请仅根据最近两年的数据来训练模型
  • 在从所有四年数据中抽取的随机子集上训练模型.
  • 更改tree_depth参数以限制模型的复杂程度.这样可以节省计算时间,因此可以允许您使用所有数据.它还可以防止过度拟合.使用交叉验证为您的问题选择最佳的树深度超参数
  • 如果尚未设置RF模型的参数,请在计算机上使用多个内核/处理器.
  • 使用更快的基于集成树的算法,例如xgboost
  • 在云中的大型计算机(例如AWS或dominodatalab)上运行模型拟合代码
  • Use a regressor which does implement partial_fit(), such as SGDRegressor
  • Check your RandomForest model's feature_importances_ attribute, then retrain your model on 3 or 4 years of data after dropping unimportant features
  • Train your model on only the most recent two years of data, if you can only use two years
  • Train your model on a random subset drawn from all four years of data.
  • Change the tree_depth parameter to constrain how complicated your model can get. This saves computation time and so may allow you to use all your data. It can also prevent overfitting. Use Cross-Validation to select the best tree-depth hyperparameter for your problem
  • Set your RF model's param n_jobs=-1 if you haven't already,to use multiple cores/processors on your machine.
  • Use a faster ensemble-tree-based algorithm, such as xgboost
  • Run your model-fitting code on a large machine in the cloud, such as AWS or dominodatalab

这篇关于使用python sklearn进行增量随机森林模型训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆