Logistic回归:如何尝试R中的每种预测变量组合? [英] Logistic regression: how to try every combination of predictors in R?
问题描述
我想进行逻辑回归:我有1个因变量和10个预测变量.
I want to perform a logistic regression: I have 1 dependent variable and ~10 predictors.
我想尝试每种组合的详尽搜索,例如更改顺序和添加/删除预测变量等.例如:
I want to perform an exhaustive search trying every combination, such as changing order and adding/deleting predictors, etc. For example:
-
y〜x1 + x2 + x3 + x4 + x5
y ~ x1 + x2 + x3 + x4 + x5
y〜x2 + x1 + x3 + x4 + x5
y ~ x2 + x1 + x3 + x4 + x5
y〜x1 + x2 + x3
y ~ x1 + x2 + x3
y〜x5 + x1 + x2 + x3 + x4
y ~ x5 + x1 + x2 + x3 + x4
y〜x4 + x2
y ~ x4 + x2
...等等.
在这种情况下,计算时间对我而言不是一个停顿的问题:这主要是一项教育活动.
Computational time is not a stopping issue for me in this case: this is mainly an educational exercise.
您知道我该怎么执行吗?我用R.
Do you know how can I perform it? I use R.
要明确:这主要是一项教育性练习:我想测试每个模型,以便可以根据某些索引(例如AUC或伪R²)对它们进行排序向我的"学生"展示哪些预测变量看起来很有趣,但没有科学意义.我计划执行自举重采样,以进一步测试"最可疑的"模型.
To be clear: this is mainly an educational exercise: I want to test every model so I can sort them all according to some indexes (such as AUC or pseudo-R²) in order to show to my "students" which predictors seem interesting but are not scientifically meaningful. I plan to perform bootstrap resampling to test further the "fishiest" models.
推荐答案
我不确定这种教育活动"的价值,但是出于编程的目的,这是我的方法:
I am not sure about the value of this "educational exercise", but for the sake of programming, here would be my approach:
首先,让我们创建一些示例预测变量名称.我在您的示例中使用了5个预测变量,但是对于10个,您显然需要将10替换为5.
First, let's create some example predictor names. I use 5 predictors as in your example, but for 10, you would obviously need to replace 5 with 10.
X = paste0("x",1:5)
X
[1] "x1" "x2" "x3" "x4" "x5"
现在,我们可以使用combn
获得组合.
Now, we can get the combinations with combn
.
例如,一次为一个变量:
For instance, for one variable at a time:
t(combn(X,1))
[,1]
[1,] "x1"
[2,] "x2"
[3,] "x3"
[4,] "x4"
[5,] "x5"
一次两个变量:
> t(combn(X,2))
[,1] [,2]
[1,] "x1" "x2"
[2,] "x1" "x3"
[3,] "x1" "x4"
[4,] "x1" "x5"
[5,] "x2" "x3"
[6,] "x2" "x4"
[7,] "x2" "x5"
[8,] "x3" "x4"
[9,] "x3" "x5"
[10,] "x4" "x5"
等
我们可以使用lapply
来依次调用这些函数,其中要考虑的变量数量越来越多,并将结果捕获在列表中.例如,看看lapply(1:5, function(n) t(combn(X,n)))
的输出.要将这些组合转换为公式,可以使用以下代码:
We can use lapply
to call these functions successively with an increasing number of variables to consider, and to catch the results in a list. For instance, have a look at the output of lapply(1:5, function(n) t(combn(X,n)))
. To turn these combinations into formulas, we can use the following:
out <- unlist(lapply(1:5, function(n) {
# get combinations
combinations <- t(combn(X,n))
# collapse them into usable formulas:
formulas <- apply(combinations, 1,
function(row) paste0("y ~ ", paste0(row, collapse = "+")))}))
或等效地使用combn
的FUN
参数(由user20650指出):
Or equivalently using the FUN
argument of combn
(as pointed out by user20650):
out <- unlist(lapply(1:5, function(n) combn(X, n, FUN=function(row) paste0("y ~ ", paste0(row, collapse = "+")))))
这给出了:
out
[1] "y ~ x1" "y ~ x2" "y ~ x3" "y ~ x4" "y ~ x5"
[6] "y ~ x1+x2" "y ~ x1+x3" "y ~ x1+x4" "y ~ x1+x5" "y ~ x2+x3"
[11] "y ~ x2+x4" "y ~ x2+x5" "y ~ x3+x4" "y ~ x3+x5" "y ~ x4+x5"
[16] "y ~ x1+x2+x3" "y ~ x1+x2+x4" "y ~ x1+x2+x5" "y ~ x1+x3+x4" "y ~ x1+x3+x5"
[21] "y ~ x1+x4+x5" "y ~ x2+x3+x4" "y ~ x2+x3+x5" "y ~ x2+x4+x5" "y ~ x3+x4+x5"
[26] "y ~ x1+x2+x3+x4" "y ~ x1+x2+x3+x5" "y ~ x1+x2+x4+x5" "y ~ x1+x3+x4+x5" "y ~ x2+x3+x4+x5"
[31] "y ~ x1+x2+x3+x4+x5"
现在可以将其传递给您的逻辑回归函数.
This can now be passed to your logistic regression function.
示例:
让我们使用mtcars
数据集,并将mpg
作为因变量.
Let's use the mtcars
dataset, with mpg
as dependent variable.
X = names(mtcars[,-1])
X
[1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
现在,让我们使用上述功能:
Now, let's use the aforementioned function:
out <- unlist(lapply(1:length(X), function(n) combn(X, n, FUN=function(row) paste0("mpg ~ ", paste0(row, collapse = "+")))))
为我们提供了所有组合的向量作为公式.
which gives us a vector of all combinations as formulas.
要运行相应的模型,我们可以做个例子
To run the corresponding models, we can do for instance
mods = lapply(out, function(frml) lm(frml, data=mtcars))
由于您要捕获特定的统计信息并相应地对模型进行排序,因此我将使用broom::glance
. broom::tidy
将lm
输出转换为数据帧(如果您想比较系数等,则很有用),而broom::glance
将例如将r平方,sigma,F统计量,logLikelihood,AIC,BIC等转换为数据框.例如:
Since you want to capture specific statistics and order the models accordingly, I would use broom::glance
. broom::tidy
turns lm
output into a dataframe (useful if you want to compare coefficients etc) and broom::glance
turns e.g. r-squared, sigma, the F-statistic, the logLikelihood, AIC, BIC etc into a dataframe. For instance:
library(broom)
library(dplyr)
tmp = bind_rows(lapply(out, function(frml) {
a = glance(lm(frml, data=mtcars))
a$frml = frml
return(a)
}))
head(tmp)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual frml
1 0.7261800 0.7170527 3.205902 79.561028 6.112687e-10 2 -81.65321 169.3064 173.7036 308.3342 30 mpg ~ cyl
2 0.7183433 0.7089548 3.251454 76.512660 9.380327e-10 2 -82.10469 170.2094 174.6066 317.1587 30 mpg ~ disp
3 0.6024373 0.5891853 3.862962 45.459803 1.787835e-07 2 -87.61931 181.2386 185.6358 447.6743 30 mpg ~ hp
4 0.4639952 0.4461283 4.485409 25.969645 1.776240e-05 2 -92.39996 190.7999 195.1971 603.5667 30 mpg ~ drat
5 0.7528328 0.7445939 3.045882 91.375325 1.293959e-10 2 -80.01471 166.0294 170.4266 278.3219 30 mpg ~ wt
6 0.1752963 0.1478062 5.563738 6.376702 1.708199e-02 2 -99.29406 204.5881 208.9853 928.6553 30 mpg ~ qsec
,您可以根据需要对其进行排序.
which you can sort as you wish.
这篇关于Logistic回归:如何尝试R中的每种预测变量组合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!