R使用列索引号预测数据框中每一列的glm拟合 [英] R predict glm fit on each column in data frame using column index number

查看:67
本文介绍了R使用列索引号预测数据框中每一列的glm拟合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使BLR模型适合数据框中的每一列,然后根据新数据pt进行预测.有很多列,因此无法通过名称(仅列号)识别列.在查看了该站点上几个类似性质的示例后,无法弄清为什么它不起作用.

Trying to fit BLR model to each column in data frame, and then predict on new data pts. Have a lot of columns, so cannot identify the columns by name, only column number. Having reviewed the several examples of similar nature on this site, cannot figure out why this does not work.

df <- data.frame(x1 = runif(1000, -10, 10),
                 x2 = runif(1000, -2, 2),
                 x3 = runif(1000, -5, 5),
                 y = rbinom(1000, size = 1, prob = 0.40))

for (i in 1:length(df)-1)
{
        fit <- glm (y ~ df[,i], data = df, family = binomial, na.action = na.exclude)

        new_pts <- data.frame(seq(min(df[,i], na.rm = TRUE), max(df[,i], na.rm = TRUE), len = 200))
        names(new_pts) <- names(df[, i])

        new_pred <- predict(fit, newdata = new_pts, type = "response")

}

predict()函数引发警告消息,并返回长1000个元素的数组,而测试数据仅包含200个元素.

The predict() function raises warning message and returns array 1000 elements long, whereas the test data has only 200 elements.

警告消息:警告消息: 'newdata'有200行,找到的变量有1000行

Warning message : Warning message: 'newdata' has 200 lines bu the variables found have 1000 lines

推荐答案

对于重复建模,我使用类似的方法,如下所示.我已经用data.table实现了它,但是可以重写它以使用基本的data.frame(我想代码会更加冗长).在这种方法中,我将所有模型存储在一个单独的对象中(下面,我提供了两个版本的代码,一个是解释性的部分,另一个是针对干净输出的高级版本的).

For repeated modelling I use a similar approach as shown below. I have implemented it with data.table, but it could be rewritten to use the base data.frame (the code would then be more verbose, I guess). In this approach I store all the models in a separate object (below I have provided two versions of the code, one more explanatory part, and one more advanced aiming at a clean output).

当然,您还可以编写一个循环/函数,该循环/函数每次迭代仅适合一个模型,而无需存储它们.从我的角度来看,保存模型是一个好主意,因为您可能必须研究模型的鲁棒性等,而不仅要预测新值.

Of course, you could also write a loop/function that only fits one model per iteration without storing them. From my perspective, its a good idea to save the models, since you probably will have to investigate the models for robustness, etc. and not only predict new values.

提示:也请查看@AndS的答案.提供整洁的方法.与这个答案一起,我认为,这对于学习/理解数据无疑是一个很好的并排比较.表格和整洁的方法

HINT: Please also have a look at the answer of @AndS. providing a tidyverse approach. Together with this answer, I think, this is certainly a nice side by side comparison for learning/understanding data.table and tidyverse approaches

# i have used some more simple data to show that the output is correct, see the plots
df <- data.frame(x1 = seq(1, 100, 10),
                 x2 = (1:10)^2,
                 y =  seq(1, 20, 2))
library(data.table)
setDT(df)
# prepare the data by melting it
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
# also i used a more simple model (in this case lm would also do)
# create model for each variable (formerly columns)
models = setnames(DT[, data.table(list(glm(y ~ x))), by = "variable"], "V1", "model")
# create a new set of data to be predicted
# NOTE: this could, of course, also be added to the models data.table
# as new column via `:=list(...)`
new_pts = setnames(DT[, seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), len = 200), by = variable], "V1", "x")
# add the predicted values
new_pts[, predicted:= predict(models[variable == unlist(.BY), model][[1]], newdata = as.data.frame(x),  type = "response")
        , by = variable]
# plot and check if it makes sense
plot(df$x1, df$y)
lines(new_pts[variable == "x1", .(x, predicted)])
points(df$x2, df$y)
lines(new_pts[variable == "x2", .(x, predicted)])

# also the following version of above code is possible
# that generates only one new objects in the environment
# but maybe looks more complicated at first sight
# not sure if this is the best way to do it
# data.table experts might provide some shortcuts
setDT(df)
DT = melt(df, measure.vars = paste0("x", 1:2), value.name = "x")
DT = data.table(variable = unique(DT$variable), dat = split(DT, DT$variable))
DT[, models:= list(list(glm(y ~ x, data = dat[[1]]))), by = variable]
DT[, new_pts:= list(list(data.frame(x = dat[[1]][
                                                 ,seq(min(x, na.rm = TRUE)
                                                 , max(x, na.rm = TRUE), len = 200)]
                                    )))
       , by = variable]
models[, predicted:= list(list(data.frame(pred = predict(model[[1]]
                                          , newdata = new_pts[[1]]
                                          ,  type = "response")))),
       by = variable]
plot(df$x1, df$y)
lines(models[variable == "x1", .(unlist(new_pts), unlist(predicted))])
points(df$x2, df$y)
lines(models[variable == "x2", .(unlist(new_pts), unlist(predicted))])

这篇关于R使用列索引号预测数据框中每一列的glm拟合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆