在Scikit-learn中进行随机森林训练之前的预随机化 [英] Pre-randomization before random forest training in Scikit-learn

查看:66
本文介绍了在Scikit-learn中进行随机森林训练之前的预随机化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过预先对训练集进行随机化处理,使用sklearn.ensemble.RandomForestClassifier可以获得令人惊讶的显着性能提升(交叉验证准确性提高了10%).这让我很困惑,因为(a)RandomForestClassifier据说还是将训练数据随机化了;和(b)为什么榜样的顺序如此重要?

I am getting a surprisingly significant performance boost (+10% cross-validation accuracy gain) with sklearn.ensemble.RandomForestClassifier just by virtue of pre-randomizing the training set. This is very puzzling to me, since (a) RandomForestClassifier supposedly randomized the training data anyway; and (b) Why would the order of example matter so much anyway?

有智慧的话吗?

推荐答案

我遇到了同样的问题,并发布了

I have got the same issue and posted a question, which luckily got resolved.

在我的情况下,这是因为数据是按顺序排列的,并且在进行测试序列拆分时,我使用的是K折叠交叉验证而没有改组.这意味着仅在具有一定模式的相邻样本块上训练模型.

In my case it's because the data are put in order, and I'm using K-fold cross-validation without shuffling when doing the test-train split. This means that the model is only trained on a chunk of adjacent samples with certain pattern.

一个极端的例子是,如果您有50行所有A类的样本,然后是50行所有B类的样本,然后手动在中间进行一次火车测试拆分.现在,该模型已经使用A类的所有样本进行了训练,但是使用B类的所有样本进行了测试,因此测试精度为0.

An extreme example would be, if you have 50 rows of sample all of class A, followed by 50 rows of sample all of class B, and you manually do a train-test split right in the middle. The model is now trained with all samples of class A, but tested with all samples of class B, hence the test accuracy will be 0.

在scikit中, train_test_split 可以默认情况下改组,而 KFold类不会.因此,您应该根据自己的情况执行以下操作之一:

In scikit, the train_test_split do the shuffling by default, while the KFold class doesn't. So you should do one of the following according to your context:

  • 先随机播放数据
  • 将 train_test_split 与 shuffle=True 一起使用(同样,这是默认设置)
  • 使用KFold并记住设置shuffle = True

这篇关于在Scikit-learn中进行随机森林训练之前的预随机化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆