为 sklearn 算法选择 random_state [英] Choosing random_state for sklearn algorithms

查看:43
本文介绍了为 sklearn 算法选择 random_state的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 random_state 在各种 sklearn 算法中用于打破具有相同度量值的不同预测器(树)之间的联系(例如在 GradientBoosting 中).但是文档没有对此进行澄清或详细说明.喜欢

I understand that random_state is used in various sklearn algorithms to break tie between different predictors (trees) with same metric value (say for example in GradientBoosting). But the documentation does not clarify or detail on this. Like

1 ) 这些种子还在哪里用于随机数生成?比如说 RandomForestClassifier ,可以使用随机数找到一组随机特征来构建预测器.使用子采样的算法,可以使用随机数来获得不同的子样本.同一个种子(random_state)是否可以/是否在多个随机数生成中起作用?

1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?

我主要关心的是

2) 这个 random_state 变量的影响有多大.?该值能否对预测(分类或回归)产生很大影响.如果是,我应该更关心什么样的数据集?还是更多的是稳定性而不是结果质量?

2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?

3) 如果它可以产生很大的不同,那么如何最好地选择该 random_state?.在没有直觉的情况下,很难在 GridSearch 上进行.特别是如果数据集是这样的,一个简历可能需要一个小时.

3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.

4) 如果动机只是对我的模型进行稳定的结果/评估,并在重复运行中交叉验证分数,如果我之前设置 random.seed(X) 是否具有相同的效果我使用任何算法(并使用 random_state 作为 None).

4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).

5) 假设我在 GradientBoosted 分类器上使用 random_state 值,并且我正在交叉验证以找到我的模型的优点(每次都对验证集进行评分).一旦满意,我将在整个训练集上训练我的模型,然后再将其应用于测试集.现在,在交叉验证中,完整的训练集比较小的训练集拥有更多的实例.因此,与 cv 循环中发生的情况相比,random_state 值现在可以导致完全不同的行为(特征和个体预测变量的选择).同样,最小样本叶等也可能导致模型较差,因为设置与 CV 中的实例数相关,而实际实例数更多.这是正确的理解吗?防止这种情况的方法是什么?

5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?

推荐答案

是的,随机种子的选择会影响您的预测结果,正如您在第四个问题中指出的,这种影响并不是真正可预测的.

Yes, the choice of the random seeds will impact your prediction results and as you pointed out in your fourth question, the impact is not really predictable.

防止偶然出现好坏预测的常见方法是训练多个模型(基于不同的随机状态)并以有意义的方式平均它们的预测.同样,您可以将交叉验证视为一种通过对多个训练/测试数据拆分的性能进行平均来估计模型真实"性能的方法.

The common way to guard against predictions that happen to be good or bad just by chance is to train several models (based on different random states) and to average their predictions in a meaningful way. Similarly, you can see cross validation as a way to estimate the "true" performance of a model by averaging the performance over multiple training/test data splits.

这篇关于为 sklearn 算法选择 random_state的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆