基于子组的线性回归分析 [英] Analysis using linear regression based on subgroups
问题描述
假设我有数据(t,y)
,在这里我期望线性相关性y(t)
.此外,每个观察值par1, par2, par3
都有属性.是否有一种算法或技术可以确定(一个或两个参数或所有参数)与拟合是否相关?我尝试了leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10)
,但无法获得最合适的公式.
Assume I have data (t,y)
, where I expect a linear dependency y(t)
. Furthermore, there exist attributes to each observation par1, par2, par3
. Is there an algorithm or technique to decide, if (one or both or all of the parameters) are relevant for the fit or not? I tried leaps::regsubsets(y ~ t + par1 + par2 + par3, data = mydata, nbest = 10)
but was not able to get the formula for the best fit.
最终结果在绘制时应如下所示.有关数据,请参见下文.
The final result should look like this if plotted. For data see below.
因此,我想要信息
Thus, I want the information
- 添加
par1
和par2
最适合 - 模型是
y_i = a_i * t_i + b_i
,具有给定的a_i
和b_i
- Adding
par1
andpar2
gives the best fit - The models are
y_i = a_i * t_i + b_i
with givena_i
andb_i
可复制的示例:
t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(1, 0.3, 0.2) # slope
b <- c(-0.5, 0.5, 0.1) # offset
# create t_i, y_ti and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
d[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10)
y[[i]] <- a[i]*d[[i]] + b[i] + noise
y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], par1=rep(1), par2=rep(10), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], par1=rep(2), par2=rep(20), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], par1=rep(2), par2=rep(30), par3=sample(c(100, 200, 300), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata <- mydata[sample(nrow(mydata)), ]
# That is what the data is looking like:
plot(mydata$t, mydata$y)
# This is the result I am looking for (ideally):
plot(d[[1]], y[[1]], col = "black", xlim = c(0, 10), ylim = c(-2, 10), xlab = "t", ylab = "y",
main = "Fit for three different groups")
points(d[[2]], y[[2]], col = "red")
points(d[[3]], y[[3]], col = "blue")
lines(d[[1]], y_t[[1]],col = "black")
lines(d[[2]], y_t[[2]], col = "red")
lines(d[[3]], y_t[[3]], col = "blue")
对@Roland答案的评论和问题:
我了解到,使用给定的三个参数,有2^3=8
个组具有2*3*3=18
个因子水平.但是我希望我们只有8个相关的组,因为我总是可以在是否包含参数x"之间进行选择.对我来说,仅包括参数y的级别x"是没有意义的.
I understand that that with the given three parameters there are 2^3=8
groups with 2*3*3=18
factor levels. But I would expect that we only have 8 relevant groups as I always have the choice between "include parameter x or not". To me it does not make sense to only "include level x of parameter y".
我尝试了以下
g <- 0
t_lin1 <- mydata$t[mydata$g == g]
y_lin1 <- mydata$y[mydata$g == g]
plot(mydata$t, mydata$y)
points(t_lin1, y_lin1, col = "red")
abline(lm(y_lin1 ~ t_lin1), col = "red")
points(pred.1se ~ t, data = mydata, col = as.integer(mydata$g), pch = 16)
并意识到合身性已关闭.回想起来很清楚,因为
and realized that the fit is off. Looking back this is clear because
- 我输入了错误的因子水平(很可能参数3不相关)
- 并因此获得适合的错误数据
所以我的最后一个问题是:
So my last question is:
- 我在哪里可以找到最佳模型中包含的相关组和
- 回归分析中对应的拟合参数是什么?
- Where can I find the relevant groups included in the best model and
- what are the corresponding fit parameters from the regression?
对不起,如果这很明显,但是对我来说是个谜
Sorry, if this was obvious but to me it is mystery
推荐答案
LASSO可能非常接近(尽管它标识了太多的影响):
The LASSO can come pretty close (although it identifies still too many effects):
#I assume these are supposed to be factors:
mydata$par1 <- factor(mydata$par1)
mydata$par2 <- factor(mydata$par2)
mydata$par3 <- factor(mydata$par3)
#create model matrix, remove intercept since glmnet adds it
x <- model.matrix(y ~ (par1 * par2 * par3) * t, data = mydata)[,-1]
#cross-validated LASSO
library(glmnet)
set.seed(42)
fit <- cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
plot(fit)
coef <- as.matrix(coef(fit, s = "lambda.1se"))
coef[coef != 0,]
#(Intercept) par230 t par12:t par230:t par3300:t
# 0.47542479 -0.27612966 0.75497711 -0.42493030 -0.15044371 0.03033057
#The groups:
mydata$g <- factor((mydata$par2 == 30) + 10 * (mydata$par1 == 2) + 100 * (mydata$par3 == 300))
mydata$pred.1se <- predict(fit, newx = x, s = "lambda.1se")
library(ggplot2)
ggplot(mydata, aes(x = t, color = g)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred.1se))
然后您可以从系数中计算出所需的截距和斜率.
You can then calculate the desired intercepts and slopes from the coefficients.
这篇关于基于子组的线性回归分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!