R data.table 按因子循环子集并执行 lm() [英] R data.table loop subset by factor and do lm()
问题描述
我正在尝试创建一个函数,甚至只是想出如何使用 data.table 语法运行循环,我可以按因子对表进行子集化,在这种情况下是 id 变量,然后在每个子集上运行线性模型,然后出结果.示例数据如下.
I am trying to create a function or even just work out how to run a loop using data.table syntax where I can subset the table by factor, in this case the id variable, then run a linear model on each subset and out the results. Sample data below.
df <- data.frame(id = letters[1:3],
cyl = sample(c("a","b","c"), 30, replace = TRUE),
factor = sample(c(TRUE, FALSE), 30, replace = TRUE),
hp = sample(c(20:50), 30, replace = TRUE))
dt=as.data.table(df)
fit <- lm(hp ~ cyl + factor, data = df) #how do I get the [i] to work here to subset and iterate by each factor and also do it in data.table syntax?
预期结果类似于 fit[1] 模型、fit[2] 模型等.
Expected outcome is sopmething like fit[1] model, fit[2] model etc..
推荐答案
我知道你想用数据表来做这件事,如果你想要拟合的某些特定方面,比如系数,那么@MartinBel 的方法是一个不错的选择一.
I know you want to do this with data tables, and if you want some specific aspect of the fit, like the coefficients, then @MartinBel's approach is a good one.
另一方面,如果您想自己存储拟合,lapply(...)
可能是更好的选择:
On the other hand, if you want to store the fits themselves, lapply(...)
might be a better option:
set.seed(1)
df <- data.frame(id = letters[1:3],
cyl = sample(c("a","b","c"), 30, replace = TRUE),
factor = sample(c(TRUE, FALSE), 30, replace = TRUE),
hp = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="id")
fits <- lapply(unique(df$id),
function(z)lm(hp~cyl+factor, data=dt[J(z),], y=T))
# coefficients
sapply(fits,coef)
# [,1] [,2] [,3]
# (Intercept) 44.117647 35.000000 3.933333e+01
# cylb -6.117647 -6.321429 -1.266667e+01
# cylc -13.176471 3.821429 -7.833333e+00
# factorTRUE 1.176471 5.535714 2.325797e-15
# predicted values
sapply(fits,predict)
# [,1] [,2] [,3]
# 1 45.29412 28.67857 26.66667
# 2 32.11765 35.00000 31.50000
# 3 30.94118 34.21429 26.66667
# ...
# residuals
sapply(fits,residuals)
# [,1] [,2] [,3]
# 1 2.7058824 0.3214286 7.333333
# 2 -2.1176471 5.0000000 -4.500000
# 3 3.0588235 8.7857143 -4.666667
# ...
# se and r-sq
sapply(fits, function(x)c(se=summary(x)$sigma, rsq=summary(x)$r.squared))
# [,1] [,2] [,3]
# se 7.923655 8.6358196 6.4592741
# rsq 0.463076 0.3069017 0.4957024
# Q-Q plots
par(mfrow=c(1,length(fits)))
lapply(fits,plot,2)
注意 key="id"
在对 data.table(...)
的调用中的使用,以及 if dt[J(z)]
对数据表进行子集化.这真的没有必要,除非 dt
很大.
Note the use of key="id"
in the call to data.table(...)
, and the use if dt[J(z)]
to subset the data table. This really isn't necessary unless dt
is enormous.
这篇关于R data.table 按因子循环子集并执行 lm()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!