分类结果取决于random_state? [英] Classification results depend on random_state?
问题描述
我想使用scikit-learn(sklearn)实现AdaBoost模型.我的问题类似于另一个问题,但并非完全相同.据我了解,文档用于根据前面的链接随机划分训练和测试集.因此,如果我理解正确,那么我的分类结果就不应依赖于种子,这是正确的吗?我是否应该担心我的分类结果是否取决于random_state变量?
I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
推荐答案
您的分类分数取决于random_state
.就像@Ujjwal正确说的那样,它用于将数据分为训练和测试测试.不仅如此,scikit-learn中的许多算法都使用random_state
来选择特征子集,样本子集并确定初始权重等.
Your classification scores will depend on random_state
. As @Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state
to select the subset of features, subsets of samples, and determine the initial weights etc.
例如
-
基于树的估计器将使用
random_state
随机选择特征和样本(例如DecisionTreeClassifier, RandomForestClassifier
).
Tree based estimators will use the
random_state
for random selections of features and samples (likeDecisionTreeClassifier, RandomForestClassifier
).
In clustering estimators like Kmeans, random_state
is used to initialize centers of clusters.
SVM将其用于初始概率估计
SVMs use it for initial probability estimation
文档中提到的内容:
如果您的代码依赖于随机数生成器,则不应使用numpy.random.random或numpy.random.normal之类的函数.这种方法可能导致测试中的可重复性问题.相反,应该使用numpy.random.RandomState对象,该对象是根据传递给类或函数的
random_state
参数构建的.
请阅读以下问题和答案以更好地理解:
Do read the following questions and answers for better understanding:
- Choosing random_state for sklearn algorithms
- confused about random_state in decision tree of scikit learn
这篇关于分类结果取决于random_state?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!