`sample_weight` 对 `DecisionTreeClassifier` 在 sklearn 中的工作方式有何影响? [英] What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

查看:22
本文介绍了`sample_weight` 对 `DecisionTreeClassifier` 在 sklearn 中的工作方式有何影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读相关文档 :

类平衡可以通过从每个类中采样相等数量的样本来完成,或者最好通过将每个类的样本权重总和 (sample_weight) 归一化为相同的值.

Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.

但是,我仍然不清楚这是如何工作的.如果我将 sample_weight 设置为只有两个可能值的数组,12,这是否意味着带有 <在进行装袋时,code>2 的采样频率将是 1 的样本的两倍?我想不出一个实际的例子.

But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1's and 2's, does this mean that the samples with 2's will get sampled twice as often as the samples with 1's when doing the bagging? I cannot think of a practical example for this.

推荐答案

一些快速准备:

假设我们有 K 个类别的分类问题.在由决策树的节点表示的特征空间区域中,回想一下杂质"使用该区域中类别的概率,通过量化不均匀性来测量该区域的 .通常,我们估计:

Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:

Pr(Class=k) = #(examples of class k in region) / #(total examples in region)

杂质度量以类别概率数组作为输入:

The impurity measure takes as input, the array of class probabilities:

[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]

并吐出一个数字,它告诉你如何不纯"或者特征空间的区域是如何不均匀的.例如,二类问题的基尼度量是2*p*(1-p),其中p = Pr(Class=1)1-p=Pr(Class=2).

and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p), where p = Pr(Class=1) and 1-p=Pr(Class=2).

现在,基本上对您的问题的简短回答是:

Now, basically the short answer to your question is:

sample_weight 增加概率数组中的概率估计 ... 这增加了杂质度量 ... 这增加了节点的分裂方式 ... 这增加了树被构建......这增强了特征空间是如何被划分为分类的.

sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.

我相信通过示例可以最好地说明这一点.

I believe this is best illustrated through example.

首先考虑以下输入为一维的二类问题:

First consider the following 2-class problem where the inputs are 1 dimensional:

from sklearn.tree import DecisionTreeClassifier as DTC

X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1,  2,  1 ] # class labels

dtc = DTC(max_depth=1)

因此,我们将查看只有一个根节点和两个子节点的树.请注意,默认的杂质度量是基尼度量.

So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.

dtc.fit(X,Y)
print dtc.tree_.threshold
# [0.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0, 0.5]

threshold 数组中的第一个值告诉我们第一个训练样例发送到左子节点,第二个和第三个训练样例发送到右子节点.threshold 中的最后两个值是占位符,将被忽略.impurity 数组分别告诉我们父节点、左节点和右节点中计算出的杂质值.

The first value in the threshold array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold are placeholders and are to be ignored. The impurity array tells us the computed impurity values in the parent, left, and right nodes respectively.

在父节点中,p = Pr(Class=1) = 2./3.,使得gini = 2*(2.0/3.0)*(1.0/3.0)= 0.444 .....您也可以确认子节点杂质.

In the parent node, p = Pr(Class=1) = 2. / 3., so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444..... You can confirm the child node impurities as well.

现在,让我们试试:

dtc.fit(X,Y,sample_weight=[1,2,3])
print dtc.tree_.threshold
# [1.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0.44444444, 0.]

你可以看到特征阈值是不同的.sample_weight 也会影响每个节点的杂质度量.具体来说,在概率估计中,由于我们提供的样本权重,第一个训练样本被计算为相同,第二个被计算为两倍,第三个被计算为三倍.

You can see the feature threshold is different. sample_weight also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.

父节点区域的杂质相同.这只是一个巧合.我们可以直接计算:

The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:

p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0

4/9 的基尼系数如下.

现在,您可以从选择的阈值中看到,第一个和第二个训练示例发送到左子节点,而第三个发送到右侧.我们看到在左子节点中也计算出杂质为4/9,因为:

Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9 also in the left child node because:

p = Pr(Class=1) = 1 / (1+2) = 1/3.

右孩子的不纯度为零是由于该区域只有一个训练样本.

The impurity of zero in the right child is due to only one training example lying in that region.

您可以类似地使用非整数样本权重扩展它.我建议尝试类似 sample_weight = [1,2,2.5] 的方法,并确认计算出的杂质.

You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5], and confirming the computed impurities.

这篇关于`sample_weight` 对 `DecisionTreeClassifier` 在 sklearn 中的工作方式有何影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆