随机森林中的class_weight超参数更改混淆矩阵中的样本数量 [英] class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

查看:1266
本文介绍了随机森林中的class_weight超参数更改混淆矩阵中的样本数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一个随机森林分类模型,该模型包含24,000个样本,其中20,000个属于class 0,而4,000个属于class 1.我做了一个train_test_split,其中test_set是整个数据集的0.2(在test_set中大约有4,800个样本).由于我正在处理不平衡的数据,因此我查看了旨在解决此问题的超参数class_weight.

I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0 and 4,000 of them belong to class 1. I made a train_test_split where test_set is 0.2 of the whole dataset (around 4,800 samples in test_set). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight which is aimed to solve this issue.

设置class_weight='balanced'并查看训练集的confusion_matrix时,我遇到的问题是:

The problem I'm facing the moment I'm setting class_weight='balanced' and look at the confusion_matrix of the training set I'm getting something like that:

array([[13209, 747], [ 2776, 2468]])

array([[13209, 747], [ 2776, 2468]])

如您所见,下面的数组对应于False Negative = 2776,后跟True Positive = 2468,而上面的数组对应于True Negative = 13209,后跟False Positive = 747.问题是,根据confusion_matrix2,776 (False Negative) + 2,468 (True Positive)属于class 1的样本数量总计为class 15,244 samples.这没有任何意义,因为整个数据集仅包含4,000个属于class 1的样本,而其中只有3200个样本位于train_set中.看起来confusion_matrix返回矩阵的Transposed版本,因为training_set中属于class 1的实际样本数量应总计train_set中的3200个样本和test_set中的800个样本.通常,正确的数字应为747 + 2468,总和为3,215,这是属于class 1的正确样本数量. 有人可以解释一下我在使用class_weight时会发生什么吗? confusion_matrix返回矩阵的transposed版本是否正确?我看错了方向吗? 我试图寻找答案,并访问了几个在某种程度上相似的问题,但是没有一个问题能真正解决这个问题.

As you can see, the lower array corresponds to False Negative = 2776 followed by True Positive = 2468, while the upper array corresponds to True Negative = 13209 followed by False Positive = 747. The problem is that the amount of samples belong to class 1 according to the confusion_matrix is 2,776 (False Negative) + 2,468 (True Positive) which sums up to 5,244 samples belong to class 1. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1 where only 3,200 of them are in the train_set. It looks like the confusion_matrix return a Transposed version of the matrix, because the actual amount of samples belong to class 1 in the training_set should sum up to 3,200 samples in train_set and 800 in test_set. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1. Can someone explain me what happens the moment I'm using class_weight? Is it true that the confusion_matrix returns a transposed version of the matrix? Am I looking at it the wrong way? I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.

这些是我看过的一些资料:

Those are some of the sources I looked at:

scikit-learn:随机森林class_weight和sample_weight参数

如何使用以下命令在随机森林中调整参数Scikit学习?

https://datascience.stackexchange.com/questions/11564 /how-does-class-weights-work-in-randomforestclassifier

https://stats.stackexchange.com /questions/244630/sample-weight-and-class-weight-randomforest-classifier的差异

在随机森林分类器的不平衡数据集

感谢您的任何帮助.

推荐答案

因此,您提供的混淆矩阵的读取似乎是正确的.

So, the reading of the confusion matrix you have provided seems to be correct.

confusion_matrix返回的转置版本是真的吗? 矩阵?

Is it true that the confusion_matrix returns a transposed version of the matrix?

如以上示例所示,否.但是,一个非常简单(无辜的错误)的原因可能是,您已经互换了y_truey_pred自变量的 order ,这很重要.结果确实是一个转置矩阵:

As the above example showed, no. But a very easy (and innocent-looking) mistake can be that you have interchanged the order of the y_true and y_pred arguments, which does matter; the result would be indeed a transposed matrix:

# correct order of arguments:
confusion_matrix(y_true, y_pred)
# array([[0, 2],
#        [1, 1]])

# inverted (wrong) order of the arguments:
confusion_matrix(y_pred, y_true)
# array([[0, 1],
#        [2, 1]])

无法确定这是否是您提供的信息的原因,这很好地提醒了您为什么应该始终提供实际代码,而不是对您的想法的口头描述您的代码正在执行...

It is impossible to say if this is the reason from the info you have provided, which is a good reminder of why you should always provide your actual code, rather than a verbal description of what you think your code is doing...

这篇关于随机森林中的class_weight超参数更改混淆矩阵中的样本数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆