`sample_weight`对`DecisionTreeClassifier`在sklearn中的工作方式有什么作用? [英] What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

查看:698
本文介绍了`sample_weight`对`DecisionTreeClassifier`在sklearn中的工作方式有什么作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从此文档中了解到

>

类平衡可以通过对每个类采样相同数量的样本来完成,或者最好将每个类的样本权重之和(sample_weight)归一化为相同值。



但是,我仍然不清楚这是如何工作的。如果我将 sample_weight 设置为仅包含两个可能值的数组,则分别为 1 2 ,这是否意味着 2 的样本的采样频率是 1的样本的两倍是在装袋时?我想不出一个实际的例子。

解决方案

所以我花了一些时间查看sklearn的源代码,因为我实际上是想自己弄清楚这一点。现在也有一段时间。我为这个长度道歉,但我不知道如何更简单地解释它。






一些快速的预备知识:



假设我们有K个类的分类问题。在由决策树的节点表示的特征空间区域中,请记住,该区域的杂质是通过使用该区域中类别的概率通过量化不均匀性来衡量的。通常,我们估计:

  Pr(Class = k)=#(区域中k类的示例)/#(总示例

杂质测度作为输入,即类概率数组:

  [Pr(Class = 1),Pr(Class = 2),...,Pr(Class = K)] 

并吐出一个数字,该数字告诉您要素空间区域的不纯程度或按类的不均匀程度。例如,两类问题的基尼度量是 2 * p *(1-p),其中 p = Pr(Class = 1) 1-p = Pr(Class = 2)






现在,基本上,您的问题的简短答案是:



sample_weight 增大概率阵列中的概率估计值 ...增大杂质度量值...增大节点的分割方式...增大树的构造方式...增大特征空间的划分方式



我相信这是最好的例子。






首先考虑以下二维问题,其中输入为一维:

 从sklearn.tree导入DecisionTreeClassifier为DTC 

X = [[0],[1],[2]]#3个简单的训练示例
Y = [1,2,1]#类标签

dtc = DTC(max_depth = 1)

所以,我们看一下树只有一个根节点和两个孩子。请注意,默认的杂质度量为基尼度量。






情况1:否 sample_weight



  dtc.fit(X,Y)
打印dtc.tree_.threshold
#[0.5,-2,-2]
打印dtc.tree_.impurity
#[0.44444444,0,0.5]

阈值数组中的第一个值告诉我们,第一个训练示例发送到左子节点,第二个训练示例发送到左子节点第三个训练示例将发送到正确的子节点。 阈值中的最后两个值是占位符,将被忽略。 杂质数组分别告诉我们在父节点,左节点和右节点中计算出的杂质值。



父节点, p = Pr(Class = 1)= 2/3。,因此 gini = 2 *(2.0 / 3.0)*( 1.0 / 3.0)= 0.444 .... 。您还可以确认子节点的杂质。






案例2:使用 sample_weight



现在,让我们尝试:

  dtc。 fit(X,Y,sample_weight = [1,2,3])
打印dtc.tree_.threshold
#[1.5,-2,-2]
打印dtc.tree_.impurity
#[0.44444444,0.44444444,0。]

您可以看到特征阈值不同。 sample_weight 也会影响每个节点中的杂质度量。具体来说,由于我们提供的样本权重,在概率估计中,第一个训练示例的计数相同,第二个训练示例的计数两倍,第三个训练示例的计数两倍。



父节点区域中的杂质相同。这只是一个巧合。我们可以直接计算它:

  p = Pr(Class = 1)=(1 + 3)/(1 + 2 + 3)= 2.0 / 3.0 

基尼系数 4/9 随后。



现在,您可以从所选阈值中看到,第一个和第二个训练示例将发送到左侧的子节点,而第三个训练示例将发送到右侧的子节点。我们看到在左子节点中杂质也被计算为 4/9 ,因为:

  p = Pr(Class = 1)= 1 /(1 + 2)= 1/3。 

在正确的孩子中零杂质是由于该地区只有一个训练示例。 / p>

您可以类似地使用非整数样本权来扩展它。我建议尝试使用类似 sample_weight = [1,2,2.5] 的方法,并确认计算出的杂质。



希望这很有帮助!


I've read from this documentation that :

"Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value."

But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1's and 2's, does this mean that the samples with 2's will get sampled twice as often as the samples with 1's when doing the bagging? I cannot think of a practical example for this.

解决方案

So I spent a little time looking at the sklearn source because I've actually been meaning to try to figure this out myself for a little while now, too. I apologize for the length, but I don't know how to explain it more briefly.


Some quick preliminaries:

Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:

Pr(Class=k) = #(examples of class k in region) / #(total examples in region)

The impurity measure takes as input, the array of class probabilities:

[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]

and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p), where p = Pr(Class=1) and 1-p=Pr(Class=2).


Now, basically the short answer to your question is:

sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.

I believe this is best illustrated through example.


First consider the following 2-class problem where the inputs are 1 dimensional:

from sklearn.tree import DecisionTreeClassifier as DTC

X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1,  2,  1 ] # class labels

dtc = DTC(max_depth=1)

So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.


Case 1: no sample_weight

dtc.fit(X,Y)
print dtc.tree_.threshold
# [0.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0, 0.5]

The first value in the threshold array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold are placeholders and are to be ignored. The impurity array tells us the computed impurity values in the parent, left, and right nodes respectively.

In the parent node, p = Pr(Class=1) = 2. / 3., so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444..... You can confirm the child node impurities as well.


Case 2: with sample_weight

Now, let's try:

dtc.fit(X,Y,sample_weight=[1,2,3])
print dtc.tree_.threshold
# [1.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0.44444444, 0.]

You can see the feature threshold is different. sample_weight also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.

The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:

p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0

The gini measure of 4/9 follows.

Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9 also in the left child node because:

p = Pr(Class=1) = 1 / (1+2) = 1/3.

The impurity of zero in the right child is due to only one training example lying in that region.

You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5], and confirming the computed impurities.

Hope this helps!

这篇关于`sample_weight`对`DecisionTreeClassifier`在sklearn中的工作方式有什么作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆