R数据表循环子集by factor和do lm() [英] R data.table loop subset by factor and do lm()

查看:183
本文介绍了R数据表循环子集by factor和do lm()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图创建一个函数,甚至只是工作如何运行一个循环使用data.table语法,其中我可以按照因子,在这种情况下的id变量子集表,然后在每个子集运行线性模型出结果。以下示例数据。

  df < -  data.frame(id = letters [1:3],
cyl = sample(c(a,b,c),30,replace = TRUE),
factor = sample(c(TRUE,FALSE),30,replace = TRUE),
hp = sample(c(20:50),30,replace = TRUE))

dt = as.data.table(df)

fit< (hp〜cyl + factor,data = df)#how我得到[i]在这里工作子集和迭代每个因素,也做在data.table语法?

预期结果是sopmething,如fit [1] model,fit [2] model等。 p>

解决方案

我知道你想用数据表来做,如果你想要一些特定的方面, @ MartinBel的方法是一个好的。



另一方面,如果你想存储适合自己, lapply(...)更好的选项:

  set.seed(1)
df< - data.frame(id = letters [ 1:3],
cyl = sample(c(a,b,c),30,replace = TRUE),
factor = sample(c(TRUE,FALSE) 30,replace = TRUE),
hp = sample(c(20:50),30,replace = TRUE))
dt< - data.table(df,key =id)

fit< - lapply(unique(df $ id),
function(z)lm(hp〜cyl + factor,data = dt [J(z),],y = T )
#系数
sapply(fits,coef)
#[,1] [,2] [,3]
#(截取)44.117647 35.000000 3.933333e + 01
#cylb -6.117647 -6.321429 -1.266667e + 01
#cylc -13.176471 3.821429 -7.833333e + 00
#factorTRUE 1.176471 5.535714 2.325797e-15

#predict值
sapply(fit,predict)
#[,1] [,2] [,3]
#1 45.29412 28.67857 26.66667
#2 32.11765 35.00000 31.50000
#3 30.94118 34.21429 26.66667
#...

#residuals
sapply(fits,residuals)
#[,1] [,2]
#1 2.7058824 0.3214286 7.333333
#2 -2.1176471 5.0000000 -4.500000
#3 3.0588235 8.7857143 -4.666667
#...

#se和r -sq
sapply(fit,function(x)c(se = summary(x)$ sigma,rsq = summary(x)$ r.squared))
#[,1] [,3]
#se 7.923655 8.6358196 6.4592741
#rsq 0.463076 0.3069017 0.4957024

#QQ绘图
par(mfrow = c(1,length(fits)) )
lapply(fit,plot,2)



注意使用 key =id code>在 data.table(...)的调用中,使用if dt [J(z)] 子集数据表。这真的是没有必要的,除非 dt 是巨大的。


I am trying to create a function or even just work out how to run a loop using data.table syntax where I can subset the table by factor, in this case the id variable, then run a linear model on each subset and out the results. Sample data below.

df <- data.frame(id = letters[1:3], 
                 cyl = sample(c("a","b","c"), 30, replace = TRUE),
                 factor = sample(c(TRUE, FALSE), 30, replace = TRUE),   
                 hp = sample(c(20:50), 30, replace = TRUE))

dt=as.data.table(df)

fit <- lm(hp ~ cyl + factor, data = df) #how do I get the [i] to work here to subset and iterate by each factor and also do it in data.table syntax?

Expected outcome is sopmething like fit[1] model, fit[2] model etc..

解决方案

I know you want to do this with data tables, and if you want some specific aspect of the fit, like the coefficients, then @MartinBel's approach is a good one.

On the other hand, if you want to store the fits themselves, lapply(...) might be a better option:

set.seed(1)
df <- data.frame(id = letters[1:3], 
                 cyl = sample(c("a","b","c"), 30, replace = TRUE),
                 factor = sample(c(TRUE, FALSE), 30, replace = TRUE),   
                 hp = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="id")

fits <- lapply(unique(df$id),
               function(z)lm(hp~cyl+factor, data=dt[J(z),], y=T))
# coefficients
sapply(fits,coef)
#                   [,1]      [,2]          [,3]
# (Intercept)  44.117647 35.000000  3.933333e+01
# cylb         -6.117647 -6.321429 -1.266667e+01
# cylc        -13.176471  3.821429 -7.833333e+00
# factorTRUE    1.176471  5.535714  2.325797e-15

# predicted values
sapply(fits,predict)
#        [,1]     [,2]     [,3]
# 1  45.29412 28.67857 26.66667
# 2  32.11765 35.00000 31.50000
# 3  30.94118 34.21429 26.66667
# ...

# residuals
sapply(fits,residuals)
#           [,1]        [,2]      [,3]
# 1    2.7058824   0.3214286  7.333333
# 2   -2.1176471   5.0000000 -4.500000
# 3    3.0588235   8.7857143 -4.666667
# ...

# se and r-sq
sapply(fits, function(x)c(se=summary(x)$sigma, rsq=summary(x)$r.squared))
#         [,1]      [,2]      [,3]
# se  7.923655 8.6358196 6.4592741
# rsq 0.463076 0.3069017 0.4957024

# Q-Q plots
par(mfrow=c(1,length(fits)))
lapply(fits,plot,2)

Note the use of key="id" in the call to data.table(...), and the use if dt[J(z)] to subset the data table. This really isn't necessary unless dt is enormous.

这篇关于R数据表循环子集by factor和do lm()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆