在scikit的决策树中对random_state感到困惑 [英] confused about random_state in decision tree of scikit learn

查看:232
本文介绍了在scikit的决策树中对random_state感到困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

random_state参数感到困惑,不确定为什么决策树训练需要一些随机性.我的想法是:(1)与随机森林有关吗? (2)与分割训练测试数据集有关吗?如果是这样,为什么不直接使用训练测试拆分方法( http://scikit-learn.org/stable/modules/generation/sklearn.cross_validation.train_test_split.html )?

Confused about random_state parameter, not sure why decision tree training needs some randomness. My thoughts, (1) is it related to random forest? (2) is it related to split training testing data set? If so, why not use training testing split method directly (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)?

http://scikit-learn.org/stable/modules/generation/sklearn.tree.DecisionTreeClassifier.html

>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

致谢, 林

推荐答案

这在

在最优性的几个方面,甚至对于简单的概念,学习最优决策树的问题都被认为是NP完全的.因此,实用的决策树学习算法基于启发式算法(例如贪婪算法),其中在每个节点上做出局部最优决策.这样的算法不能保证返回全局最优决策树.可以通过在集成学习器中训练多棵树来缓解这种情况,该学习器中通过替换随机抽取特征和样本.

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

因此,基本上,使用特征和样本的随机选择(次于随机森林中使用的类似技术)将次优贪婪算法重复多次. random_state参数允许控制这些随机选择.

So, basically, a sub-optimal greedy algorithm is repeated a number of times using random selections of features and samples (a similar technique used in random forests). The random_state parameter allows controlling these random choices.

界面文档特别指出:

The interface documentation specifically states:

如果为int,则random_state是随机数生成器使用的种子;否则为false.如果是RandomState实例,则random_state是随机数生成器;否则,如果为None,则随机数生成器是np.random使用的RandomState实例.

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

因此,在任何情况下都将使用随机算法.传递任何值(无论是特定的int,例如0还是RandomState实例)都不会改变该值.传递int值(0或其他方式)的唯一理由是使调用之间的结果保持一致:如果使用random_state=0(或任何其他值)进行调用,则每次都会得到相同的结果结果.

So, the random algorithm will be used in any case. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you'll get the same result.

这篇关于在scikit的决策树中对random_state感到困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆