分类模型中的random_state参数 [英] random_state parameter in classification models

查看:93
本文介绍了分类模型中的random_state参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释为什么random_state参数对模型的影响如此之大吗?

Can someone explain why does the random_state parameter affects the model so much?

我有一个RandomForestClassifier模型,想设置random_state(用于再现性),但是根据我使用的值,我得到的整体评估指标(F1得分)却有很大不同

I have a RandomForestClassifier model and want to set the random_state (for reproducibility pourpouses), but depending on the value I use I get very different values on my overall evaluation metric (F1 score)

例如,我尝试使用100个不同的random_state值拟合同一模型,并且在进行训练性广告测试后,最小的F1为0.64516129,最大为0.808823529).那是巨大的差异.

For example, I tried to fit the same model with 100 different random_state values and after the training ad testing the smallest F1 was 0.64516129 and the largest 0.808823529). That is a huge difference.

这种行为似乎也很难比较两个模型.

This behaviour also seems to make very hard to compare two models.

有想法吗?

推荐答案

如果random_state影响您的结果,则意味着您的模型具有高方差.对于随机森林",这仅意味着您使用太小森林,并且应增加树木数量(由于装袋,减少了方差).在scikit-learn中,这是由构造函数中的 n_estimators 参数控制的.

If the random_state affects your results it means that your model has a high variance. In case of Random Forest this simply means that you use too small forest and should increase number of trees (which due to bagging - reduce variance). In scikit-learn this is controlled by n_estimators parameters in the constructor.

为什么会这样?每种ML方法都试图使误差最小化,从成熟度的角度来看,该误差通常可以分解为偏差和方差[+噪声](请参见偏差方差莳萝/权衡).偏差只是您的模型必须要达到的期望值与真实值之间的距离-错误的这一部分通常来自某些先前的假设,例如对非线性问题使用线性模型等.方差是训练时结果的差异在不同的数据子集上(或使用不同的超参数,并且在使用随机方法的情况下,随机种子是参数).超参数由我们初始化,参数由模型本身在训练过程中学习.最后-来自问题本身(或数据表示形式)的噪声是无法减少的错误.因此,在您的情况下-您仅遇到具有高方差的模型,决策树以其极高的方差(和较小的偏差)而闻名.因此,为减少差异,Breiman提出了一种特定的套袋方法,即今天的Random Forest.森林越大,减少方差的效果越强.尤其是-一棵树的森林差异很大,一千棵树的森林几乎可以确定地解决中等大小的问题.

Why this happens? Each ML method tries to minimize the error, which from matematial perspective can be usually decomposed to bias and variance [+noise] (see bias variance dillema/tradeoff). Bias is simply how far from true values your model has to end up in the expectation - this part of an error usually comes from some prior assumptions, such as using linear model for nonlinear problem etc. Variance is how much your results differ when you train on different subsets of data (or use different hyperparameters, and in case of randomized methods random seed is a parameter). Hyperparameters are initialized by us and Parameters are learnt by the model itself in the training process. Finally - noise is not reducible error coming from the problem itself (or data representation). Thus, in your case - you simply encountered model with high variance, decision trees are well known for their extremely high variance (and small bias). Thus to reduce variance, Breiman proposed the specific bagging method, known today as Random Forest. The larger the forest - stronger the effect of variance reduction. In particular - forest with 1 tree has huge variance, forest of 1000 trees is nearly deterministic for moderate size problems.

总而言之,您能做什么?

To sum up, what you can do?

  • 增加树木的数量-这必须奏效,并且是众所周知且合理的方法
  • 在评估过程中将random_seed作为超参数进行处理,因为这恰恰是这样-如果您不希望增加森林的规模,则需要事先修正元知识.

这篇关于分类模型中的random_state参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆