随机森林的混淆矩阵中的误差 [英] Error in Confusion Matrix with Random Forest

查看:788
本文介绍了随机森林的混淆矩阵中的误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含4669个观测值和15个变量的数据集.

我正在使用随机森林来预测某个特定产品是否会被接受.

使用最新数据,我的输出变量为是",否"和".

我想预测此"是否为是"或否".

我正在使用以下代码.

library(randomForest)

outputvar <- c("Yes", "NO", "Yes", "NO", "" , "" )
inputvar1 <- c("M", "M", "F", "F", "M", "F")
inputvar2 <- c("34", "35", "45", "60", "34", "23")
data <- data.frame(cbind(outputvar, inputvar1, inputvar2))
data$outputvar <- factor(data$outputvar, exclude = "")
ind0 <- sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
train0 <- data[ind0==1, ]
test0 <-  data[ind0==2, ]

fit1 <- randomForest(outputvar~., data=train0, na.action = na.exclude)
print(fit1)
plot(fit1)
p1 <- predict(fit1, train0)
fit1$confusion

p2 <- predict(fit1, test0)

t <- table(prediction = p2, actual = test0$outputvar)
t

上面的代码运行完美.我提到的数据帧只是一个示例数据帧.因为,我不应该产生原始数据.

您可能会注意到,我将我的训练数据和测试数据分为70%和30%. 从我的观察中,我可以找到包含1377个观察值的测试数据和包含3293个观察值的训练.

当我为测试数据集计算混淆矩阵时,我发现它仅针对1363个观测值进行了计算,还剩下14个观测值.

我还用测试数据集可视化了预测矩阵表. 所有这些NA均替换为是"或否".

我的疑问是,为什么我的混淆矩阵在观察上有差异?

在我的预测矩阵中是否将那些NA替换为是和否?

我是R的新手,任何信息都将对您有所帮助

解决方案

您似乎对这里的一些基本问题感到困惑...

首先,训练缺少从属变量(此处为outputvar)的数据毫无意义;如果我们没有样本的实际结果,就不能将其用于训练,我们应该将其从训练集中删除(保存一些相当极端的方法,即在将样本输入到样本之前尝试估算这些样本).分类器).

第二,尽管您似乎暗示(有点……)您的两个缺少outputvar的样本是您要预测的未知样本,但实际上(例如,在您的代码中)您并未使用它们就这样:由于您使用sample函数将数据拆分为训练&测试子集是随机的,很容易会出现这两个样本中的至少一个(甚至两个)最终出现在您的 training 集中的情况,当然这是没有用的.

第三,即使在某些运行中您最终在测试集中确实得到了这两个样本,您当然也无法计算出任何混淆矩阵,因为这样做确实需要基础事实(真实标签).

总而言之,没有真实标签的数据样本(如此处的最后2个样本)对于训练或任何形式的评估(例如混淆矩阵)都没有用.它们既不能在训练集中使用,也不能在测试集中使用.

上面的代码运行完美

并非总是如此;由于sample函数的随机性,您可能很容易以训练/测试拆分为结尾,这使得分类器无法运行:

> source('~/.active-rstudio-document')  # your code verbatim
Error in randomForest.default(m, y, ...) : 
  Need at least two classes to do classification.
> train0
  outputvar inputvar1 inputvar2
1       Yes         M        34
5      <NA>         M        34

尝试几次自己重新运行代码以查看(由于未设置随机种子,因此每次运行原则上都将有所不同-甚至培训和测试集的 length 也不会)在两次运行之间保持一致!).

当我为测试数据集计算混淆矩阵时,我发现它仅针对1363个观测值进行了计算,还剩下14个观测值.

鉴于您所显示的示例,此处的一个很好的猜测是您没有这14个观测值的真实标签.而且由于混淆矩阵来自预测与实际标签的比较,因此当实际标签缺失时,不可能进行比较,并且自然会从混淆矩阵中忽略这些样本.

我还用测试数据集可视化了预测矩阵表.所有这些NA均替换为是"或否".

现在您在这里究竟是什么意思还不清楚.但是,如果您要在测试集上运行predict,并且在预测中未获得任何NA,则完全符合预期.如上文所述,混淆矩阵中的缺失条目"不是由于缺少预测,而是由于缺少真实标签.

I have a dataset with 4669 observations and 15 variables.

I am using Random forest to predict if a particular product will be accepted or not.

With my latest data , I have my output variable with "Yes", "NO" and "".

I wanted to predict if this "" will have Yes or No.

I am using the following code.

library(randomForest)

outputvar <- c("Yes", "NO", "Yes", "NO", "" , "" )
inputvar1 <- c("M", "M", "F", "F", "M", "F")
inputvar2 <- c("34", "35", "45", "60", "34", "23")
data <- data.frame(cbind(outputvar, inputvar1, inputvar2))
data$outputvar <- factor(data$outputvar, exclude = "")
ind0 <- sample(2, nrow(data), replace = TRUE, prob = c(0.7,0.3))
train0 <- data[ind0==1, ]
test0 <-  data[ind0==2, ]

fit1 <- randomForest(outputvar~., data=train0, na.action = na.exclude)
print(fit1)
plot(fit1)
p1 <- predict(fit1, train0)
fit1$confusion

p2 <- predict(fit1, test0)

t <- table(prediction = p2, actual = test0$outputvar)
t

The above code runs perfectly. the data frame I have mentioned is only a sample data frame. Since, I am not supposed to produce the original data.

AS you could notice I have divided my training data and test data into 70 and 30%. from my observation I could find test data with 1377 observation and training with 3293 observations.

When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.

Also, I visualised the table for the predicted matrix with test data set. All those NA are replaced with Yes or NO.

My doubt is, Why does my confusion matrix have difference in observation ?

Are those NA replaced in my prediction matrix as Yes and No are real predictions ??

I am new to R, and any information would be helpful

解决方案

You seem a little confused regarding several elementary issues here...

To start with, training data with the dependent variable missing (here outputvar) make no sense; if we don't have the actual outcome for a sample, we cannot use it for training, and we should simply remove it from the training set (save for some rather extreme approaches, where one tries to impute such samples before feeding them to the classifier).

Second, although you seem to imply (kind of...) that your 2 samples with missing outputvar here are the unknown samples you are trying to predict, in practice (i.e in your code) you are not using them as such: since the sample function you use to split your data into training & test subsets is random, it can easily be the case that at least one (or even both) of these 2 samples ends up in your training set, where of course it will be of no use.

Third, even if in some runs you end up indeed with these 2 samples in your test set, you cannot of course calculate any confusion matrix, since you do need the ground truth (real labels) for doing so.

All in all, data samples without the true label, like your 2 last ones here, are useful neither for training nor for evaluation of any kind, such as the confusion matrix. They cannot be used either in the training set or in the test set.

The above code runs perfectly

Not always; due to the random nature of the sample function, you may easily end up with train/test splits that make the classifier impossible to run:

> source('~/.active-rstudio-document')  # your code verbatim
Error in randomForest.default(m, y, ...) : 
  Need at least two classes to do classification.
> train0
  outputvar inputvar1 inputvar2
1       Yes         M        34
5      <NA>         M        34

Try to re-run the code yourself several times to see (since no random seed is set, each run will in principle be different - even the length of your training & test sets will not be the same between runs!).

When I am calculating my Confusion matrix for test data set, I could find that it has calculated only for 1363 observations and 14 observations are left.

Given what you have shown as a sample, a good guess here is that you do not have the true labels for these 14 observations. And since the confusion matrix comes from a comparison of the predictions versus the actual labels, when the latter are missing the comparison is impossible, and these samples are naturally omitted from the confusion matrix.

Also, I visualised the table for the predicted matrix with test data set. All those NA are replaced with Yes or NO.

It is not quite clear what exactly you mean here; but if you mean that you run predict on your test set and you did not get any NAs in the predictions, this is exactly as expected. As I explained above, the "missing entries" from your confusion matrix are not due to missing predictions, but due to missing true labels.

这篇关于随机森林的混淆矩阵中的误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆