如何在R中执行随机森林/交叉验证 [英] How to perform random forest/cross validation in R

查看:1075
本文介绍了如何在R中执行随机森林/交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法找到对要生成的回归随机森林模型执行交叉验证的方法。

I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce.

所以我有一个数据集,其中包含1664个解释变量(不同的化学性质),其中一个响应变量(保留时间)。我试图建立一个回归随机森林模型,以便能够根据给定的保留时间预测某物的化学性质。

So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time.

ID  RT (seconds)    1_MW    2_AMW   3_Sv    4_Se
4281    38  145.29  5.01    14.76   28.37
4952    40  132.19  6.29    11  21.28
4823    41  176.21  7.34    12.9    24.92
3840    41  174.24  6.7 13.99   26.48
3665    42  240.34  9.24    15.2    27.08
3591    42  161.23  6.2 13.71   26.27
3659    42  146.22  6.09    12.6    24.16

这是我所拥有的表格的示例。我想基本上将RT相对于1_MW等(最多1664个变量)作图,所以我可以发现这些变量中哪些是重要的,哪些不是。

This is an example of the table that I have. I want to basically plot RT against 1_MW, etc (up to 1664 variables), so I can find which of these variables are of importance and which aren't.

I做:-

r = randomForest(RT..seconds.~., data = cadets, importance =TRUE, do.trace = 100)
varImpPlot(r)

这告诉我哪些变量是重要的,什么是变量不,这很棒。但是,我希望能够对数据集进行分区,以便可以对其执行交叉验证。我找到了一个在线教程,该教程说明了如何执行此操作,但是只针对分类模型,而不是回归模型。

which tells me which variables are of importance and what not, which is great. However, I want to be able to partition my dataset so that I can perform cross validation on it. I found an online tutorial that explained how to do it, but for a classification model rather than regression.

我了解您这样做:-

k = 10
n = floor(nrow(cadets)/k)
i = 1
s1 = ((i-1) * n+1)
s2 = (i * n)
subset = s1:s2

定义要进行多少次交叉折叠,以及每个折叠的大小折叠,并设置子集的开始和结束值。但是,我不知道以后该怎么办。我被告知要遍历,但老实说我不知道​​该怎么做。我也不知道如何将验证集和测试集绘制到同一张图上以描述准确性/错误级别。

to define how many cross folds you want to do, and the size of each fold, and to set the starting and end value of the subset. However, I don't know what to do here on after. I was told to loop through but I honestly have no idea how to do this. Nor do I know how to then plot the validation set and the test set onto the same graph to depict the level of accuracy/error.

如果可以的话,请帮我

推荐答案

来自


袋外(oob)错误估计

在随机森林中,不需要交叉验证或单独的
测试集可获取测试集错误的无偏估计。是运行期间内部估算的
...

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally , during the run...

特别是 predict.randomForest 如果没有给出 newdata ,则返回袋外预测。

In particular, predict.randomForest returns the out-of-bag prediction if newdata is not given.

这篇关于如何在R中执行随机森林/交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆