sklearn DecisionTreeClassifier 中 min_samples_split 和 min_samples_leaf 的区别 [英] Difference between min_samples_split and min_samples_leaf in sklearn DecisionTreeClassifier
问题描述
我正在学习 sklearn 课程 DecisionTreeClassifier.
I was going through sklearn class DecisionTreeClassifier.
查看类的参数,我们有两个参数 min_samples_split 和 min_samples_leaf.它们背后的基本思想看起来很相似,您可以指定决定一个节点是叶子节点还是进一步分裂所需的最小样本数.
Looking at parameters for the class, we have two parameters min_samples_split and min_samples_leaf. Basic idea behind them looks similar, you specify a minimum number of samples required to decide a node to be leaf or split further.
当一个参数暗示另一个时,为什么我们需要两个参数?.有什么区别它们的原因或场景吗?
Why do we need two parameters when one implies the other?. Is there any reason or scenario which distinguish them?.
推荐答案
来自文档:
两者的主要区别在于 min_samples_leaf
保证了一个叶子中的最小样本数,而 min_samples_split
可以创建任意的小叶子,尽管 min_samples_split
在文献中更常见.
The main difference between the two is that
min_samples_leaf
guarantees a minimum number of samples in a leaf, whilemin_samples_split
can create arbitrary small leaves, thoughmin_samples_split
is more common in the literature.
要掌握这篇文档,我认为您应该区分叶(也称为外部节点)和内部节点强>.内部节点将进一步分裂(也称为子节点),而根据定义,叶子节点是没有任何子节点(没有任何进一步分裂)的节点.
To get a grasp of this piece of documentation I think you should make the distinction between a leaf (also called external node) and an internal node. An internal node will have further splits (also called children), while a leaf is by definition a node without any children (without any further splits).
min_samples_split
指定分裂内部节点所需的最小样本数,而 min_samples_leaf
指定分裂所需的最小样本数叶节点.
min_samples_split
specifies the minimum number of samples required to split an internal node, while min_samples_leaf
specifies the minimum number of samples required to be at a leaf node.
例如,如果min_samples_split = 5
,并且内部节点有7个样本,则允许拆分.但是让我们假设分裂导致两片叶子,一片有 1 个样本,另一个有 6 个样本.如果 min_samples_leaf = 2
,则不允许拆分(即使内部节点有 7 个样本),因为产生的叶子之一将少于所需的最小样本数叶节点.
For instance, if min_samples_split = 5
, and there are 7 samples at an internal node, then the split is allowed. But let's say the split results in two leaves, one with 1 sample, and another with 6 samples. If min_samples_leaf = 2
, then the split won't be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.
如上面引用的文档所述,min_samples_leaf
保证每个叶子中的样本数量最少,无论 min_samples_split
的值如何.
As the documentation referenced above mentions, min_samples_leaf
guarantees a minimum number of samples in every leaf, no matter the value of min_samples_split
.
这篇关于sklearn DecisionTreeClassifier 中 min_samples_split 和 min_samples_leaf 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!