随机森林中的class_weight超参数更改混淆矩阵中的样本数量 [英] class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

查看：1266 发布时间：2020/5/4 9:57:03 python machine-learning scikit-learn random-forest confusion-matrix

本文介绍了随机森林中的class_weight超参数更改混淆矩阵中的样本数量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在研究一个随机森林分类模型，该模型包含24,000个样本，其中20,000个属于class 0，而4,000个属于class 1.我做了一个train_test_split，其中test_set是整个数据集的0.2(在test_set中大约有4,800个样本).由于我正在处理不平衡的数据，因此我查看了旨在解决此问题的超参数class_weight.

I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0 and 4,000 of them belong to class 1. I made a train_test_split where test_set is 0.2 of the whole dataset (around 4,800 samples in test_set). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight which is aimed to solve this issue.

设置class_weight='balanced'并查看训练集的confusion_matrix时，我遇到的问题是:

The problem I'm facing the moment I'm setting class_weight='balanced' and look at the confusion_matrix of the training set I'm getting something like that:

array([[13209, 747], [ 2776, 2468]])

如您所见，下面的数组对应于False Negative = 2776，后跟True Positive = 2468，而上面的数组对应于True Negative = 13209，后跟False Positive = 747.问题是，根据confusion_matrix的2,776 (False Negative) + 2,468 (True Positive)属于class 1的样本数量总计为class 1的5,244 samples.这没有任何意义，因为整个数据集仅包含4,000个属于class 1的样本，而其中只有3200个样本位于train_set中.看起来confusion_matrix返回矩阵的Transposed版本，因为training_set中属于class 1的实际样本数量应总计train_set中的3200个样本和test_set中的800个样本.通常，正确的数字应为747 + 2468，总和为3,215，这是属于class 1的正确样本数量. 有人可以解释一下我在使用class_weight时会发生什么吗? confusion_matrix返回矩阵的transposed版本是否正确?我看错了方向吗? 我试图寻找答案，并访问了几个在某种程度上相似的问题，但是没有一个问题能真正解决这个问题.

As you can see, the lower array corresponds to False Negative = 2776 followed by True Positive = 2468, while the upper array corresponds to True Negative = 13209 followed by False Positive = 747. The problem is that the amount of samples belong to class 1 according to the confusion_matrix is 2,776 (False Negative) + 2,468 (True Positive) which sums up to 5,244 samples belong to class 1. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1 where only 3,200 of them are in the train_set. It looks like the confusion_matrix return a Transposed version of the matrix, because the actual amount of samples belong to class 1 in the training_set should sum up to 3,200 samples in train_set and 800 in test_set. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1. Can someone explain me what happens the moment I'm using class_weight? Is it true that the confusion_matrix returns a transposed version of the matrix? Am I looking at it the wrong way? I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.

这些是我看过的一些资料:

Those are some of the sources I looked at:

scikit-learn:随机森林class_weight和sample_weight参数

如何使用以下命令在随机森林中调整参数Scikit学习?

https://datascience.stackexchange.com/questions/11564 /how-does-class-weights-work-in-randomforestclassifier

https://stats.stackexchange.com /questions/244630/sample-weight-and-class-weight-randomforest-classifier的差异

在随机森林分类器的不平衡数据集

感谢您的任何帮助.

推荐答案

从因此，您提供的混淆矩阵的读取似乎是正确的.

So, the reading of the confusion matrix you have provided seems to be correct.

confusion_matrix返回的转置版本是真的吗? 矩阵?

Is it true that the confusion_matrix returns a transposed version of the matrix?

如以上示例所示，否.但是，一个非常简单(无辜的错误)的原因可能是，您已经互换了y_true和y_pred自变量的 order ，这很重要.结果确实是一个转置矩阵:

As the above example showed, no. But a very easy (and innocent-looking) mistake can be that you have interchanged the order of the y_true and y_pred arguments, which does matter; the result would be indeed a transposed matrix:

# correct order of arguments:
confusion_matrix(y_true, y_pred)
# array([[0, 2],
#        [1, 1]])

# inverted (wrong) order of the arguments:
confusion_matrix(y_pred, y_true)
# array([[0, 1],
#        [2, 1]])

无法确定这是否是您提供的信息的原因，这很好地提醒了您为什么应该始终提供实际代码，而不是对您的想法的口头描述您的代码正在执行...

It is impossible to say if this is the reason from the info you have provided, which is a good reminder of why you should always provide your actual code, rather than a verbal description of what you think your code is doing...

这篇关于随机森林中的class_weight超参数更改混淆矩阵中的样本数量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

随机森林中的class_weight超参数更改混淆矩阵中的样本数量 [英] class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

随机森林中的class_weight超参数更改混淆矩阵中的样本数量 [英] class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭