基于子组的线性回归分析 [英] Analysis using linear regression based on subgroups

查看:88
本文介绍了基于子组的线性回归分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有数据(t,y),在这里我期望线性相关性y(t).此外,每个观察值par1, par2, par3都有属性.是否有一种算法或技术可以确定(一个或两个参数或所有参数)与拟合是否相关?我尝试了leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10),但无法获得最合适的公式.

Assume I have data (t,y), where I expect a linear dependency y(t). Furthermore, there exist attributes to each observation par1, par2, par3. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10) but was not able to get the formula for the best fit.

最终结果在绘制时应如下所示.有关数据,请参见下文.

The final result should look like this if plotted. For data see below.


因此,我想要信息


Thus, I want the information

  • 添加par1par2最适合
  • 模型是y_i = a_i * t_i + b_i,具有给定的a_ib_i
  • Adding par1and par2 gives the best fit
  • The models are y_i = a_i * t_i + b_i with given a_i and b_i

可复制的示例:

t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset

# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
  set.seed(33*i)
  d[[i]] <- sort(sample(t, 50, replace = F))
  set.seed(33*i)
  noise <- rnorm(10)
  y[[i]] <- a[i]*d[[i]] + b[i] + noise
  y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]

# That is what the data is looking like:
plot(mydata$t, mydata$y)

# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
     main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")

对@Roland答案的评论和问题:

我了解到,使用给定的三个参数,有2^3=8个组具有2*3*3=18个因子水平.但是我希望我们只有8个相关的组,因为我总是可以在是否包含参数x"之间进行选择.对我来说,仅包括参数y的级别x"是没有意义的.

I understand that that with the given three parameters there are 2^3=8 groups with 2*3*3=18 factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".

我尝试了以下

g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)

并意识到合身性已关闭.回想起来很清楚,因为

and realized that the fit is off. Looking back this is clear because

  • 我输入了错误的因子水平(很可能参数3不相关)
  • 并因此获得适合的错误数据

所以我的最后一个问题是:

So my last question is:

  • 我在哪里可以找到最佳模型中包含的相关组
  • 回归分析中对应的拟合参数是什么?
  • Where can I find the relevant groups included in the best model and
  • what are the corresponding fit parameters from the regression?

对不起,如果这很明显,但是对我来说是个谜

Sorry, if this was obvious but to me it is mystery

推荐答案

LASSO可能非常接近(尽管它标识了太多的影响):

The LASSO can come pretty close (although it identifies still too many effects):

#I assume these are supposed to be factors:
mydata$par1 <- factor(mydata$par1)
mydata$par2 <- factor(mydata$par2)
mydata$par3 <- factor(mydata$par3)

#create model matrix, remove intercept since glmnet adds it
x <- model.matrix(y ~ (par1 * par2 * par3) * t, data = mydata)[,-1]

#cross-validated LASSO
library(glmnet)
set.seed(42)
fit <- cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
plot(fit)

coef <- as.matrix(coef(fit, s = "lambda.1se"))
coef[coef != 0,]
#(Intercept)      par230           t     par12:t    par230:t   par3300:t 
# 0.47542479 -0.27612966  0.75497711 -0.42493030 -0.15044371  0.03033057

#The groups:
mydata$g <- factor((mydata$par2 == 30) + 10 * (mydata$par1 == 2) + 100 * (mydata$par3 == 300))



mydata$pred.1se <- predict(fit, newx = x, s = "lambda.1se")

library(ggplot2)
ggplot(mydata, aes(x = t, color = g)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred.1se))

然后您可以从系数中计算出所需的截距和斜率.

You can then calculate the desired intercepts and slopes from the coefficients.

这篇关于基于子组的线性回归分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆