随机森林中的class_weight超参数更改混淆矩阵中的样本数量 [英] class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix
问题描述
我目前正在研究一个随机森林分类模型,该模型包含24,000个样本,其中20,000个属于class 0
,而4,000个属于class 1
.我做了一个train_test_split
,其中test_set是整个数据集的0.2
(在test_set
中大约有4,800个样本).由于我正在处理不平衡的数据,因此我查看了旨在解决此问题的超参数class_weight
.
I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0
and 4,000 of them belong to class 1
. I made a train_test_split
where test_set is 0.2
of the whole dataset (around 4,800 samples in test_set
). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight
which is aimed to solve this issue.
设置class_weight='balanced'
并查看训练集的confusion_matrix
时,我遇到的问题是:
The problem I'm facing the moment I'm setting class_weight='balanced'
and look at the confusion_matrix
of the training set I'm getting something like that:
array([[13209, 747],
[ 2776, 2468]])
array([[13209, 747],
[ 2776, 2468]])
如您所见,下面的数组对应于False Negative = 2776
,后跟True Positive = 2468
,而上面的数组对应于True Negative = 13209
,后跟False Positive = 747
.问题是,根据confusion_matrix
的2,776 (False Negative) + 2,468 (True Positive)
属于class 1
的样本数量总计为class 1
的5,244 samples
.这没有任何意义,因为整个数据集仅包含4,000个属于class 1
的样本,而其中只有3200个样本位于train_set
中.看起来confusion_matrix
返回矩阵的Transposed
版本,因为training_set
中属于class 1
的实际样本数量应总计train_set
中的3200个样本和test_set
中的800个样本.通常,正确的数字应为747 + 2468,总和为3,215,这是属于class 1
的正确样本数量.
有人可以解释一下我在使用class_weight
时会发生什么吗? confusion_matrix
返回矩阵的transposed
版本是否正确?我看错了方向吗?
我试图寻找答案,并访问了几个在某种程度上相似的问题,但是没有一个问题能真正解决这个问题.
As you can see, the lower array corresponds to False Negative = 2776
followed by True Positive = 2468
, while the upper array corresponds to True Negative = 13209
followed by False Positive = 747
. The problem is that the amount of samples belong to class 1
according to the confusion_matrix
is 2,776 (False Negative) + 2,468 (True Positive)
which sums up to 5,244 samples
belong to class 1
. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1
where only 3,200 of them are in the train_set
. It looks like the confusion_matrix
return a Transposed
version of the matrix, because the actual amount of samples belong to class 1
in the training_set
should sum up to 3,200 samples in train_set
and 800 in test_set
. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1
.
Can someone explain me what happens the moment I'm using class_weight
? Is it true that the confusion_matrix
returns a transposed
version of the matrix? Am I looking at it the wrong way?
I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.
这些是我看过的一些资料:
Those are some of the sources I looked at:
scikit-learn:随机森林class_weight和sample_weight参数
感谢您的任何帮助.
推荐答案
So, the reading of the confusion matrix you have provided seems to be correct.
confusion_matrix返回的转置版本是真的吗? 矩阵?
Is it true that the confusion_matrix returns a transposed version of the matrix?
如以上示例所示,否.但是,一个非常简单(无辜的错误)的原因可能是,您已经互换了y_true
和y_pred
自变量的 order ,这很重要.结果确实是一个转置矩阵:
As the above example showed, no. But a very easy (and innocent-looking) mistake can be that you have interchanged the order of the y_true
and y_pred
arguments, which does matter; the result would be indeed a transposed matrix:
# correct order of arguments:
confusion_matrix(y_true, y_pred)
# array([[0, 2],
# [1, 1]])
# inverted (wrong) order of the arguments:
confusion_matrix(y_pred, y_true)
# array([[0, 1],
# [2, 1]])
无法确定这是否是您提供的信息的原因,这很好地提醒了您为什么应该始终提供实际代码,而不是对您的想法的口头描述您的代码正在执行...
It is impossible to say if this is the reason from the info you have provided, which is a good reminder of why you should always provide your actual code, rather than a verbal description of what you think your code is doing...
这篇关于随机森林中的class_weight超参数更改混淆矩阵中的样本数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!