如何从线性拟合中提取行数和相应的方程式 [英] How can I extract the number of lines and the corresponding equations from a linear fit
问题描述
我有数据,我期望几种形式的线性相关
I have data and I expect several linear correlations of the form
y_i = a_i + b_i * t_i, i = 1 .. N
其中,N
是先验未知数.这个问题的简短版本是:给定适合条件
where N
is a priori unknown. The short version of the question is: Given a fit
- 如何提取
N
? - 如何提取方程式?
在下面的可重现示例中,我有具有相应参数p1
(级别p1_1
,p1_2
)和p2
(级别p2_1
,p2_2
,p2_3
)的数据(t,y)
.
因此,数据看起来像(t, y, p1, p2)
,最多具有 2 * 3条不同的最佳拟合线,而来自的线性拟合最多具有 2 * 2 * 3个非拟合线零系数.
In the reproducible example below, I have data (t,y)
with corresponding parameters p1
(levels p1_1
, p1_2
) and p2
(levels p2_1
, p2_2
, p2_3
).
Thus the data looks like (t, y, p1, p2)
which has at most 2*3 different best-fit lines and the linear fit from has then at most 2*2*3 non-zero coefficients.
我遇到了以下问题:假设我有三个方程式
I run into the following problems: Assume I have the three equations
y1 = 5 + 3*t (for p1=p1_1, p2=p2_1)
y2 = 3 + t (for p1=p1_2, p2=p2_2)
y3 = 1 – t (for p1=p1_2, p2=p2_3)
运行 cv.glmnet(y〜t * p1 * p2,...)会产生
Running cv.glmnet(y ~ t * p1 * p2, ...) yields
(Intercept) 5
t 3 => y1 = 5 + 3t
p1p1_2 -2 => y2 = 3 + 3t?
p2p2_2 .
p2p2_3 -2 => y3 = 1 + 3t?
t:p1p1_2 -2 => y4 = 3 + t (or y4 = 1 + t?)
t:p2p2_2 .
t:p2p2_3 -2 => y5 = 1 - t
p1p1_2:p2p2_2 .
p1p1_2:p2p2_3 -0.1 => y6 = 0.9 – t?
t:p1p1_2:p2p2_2 .
t:p1p1_2:p2p2_3 .
所需结果:程序应建议4个方程y1,更正y4,y5和y6,希望有充分的理由(哪个?)忽略y6.
Desired result: program should suggest 4 equations y1, correct y4, y5 and y6, hopefully there is a good reason (which one?) to ignore y6.
(Intercept) 5
t 3 => y1 = 5 + 3t
p1p1_2 -4 => y2 = 1 + 3t?
p2p2_2 2 => y3 = 3 + 3t
p2p2_3 .
t:p1p1_2 -4 => y5 = 1 - x (or y4 = 3 - t?)
t:p2p2_2 2 => y6 = 3 + t?
t:p2p2_3 .
p1p1_2:p2p2_2 .
p1p1_2:p2p2_3 .
t:p1p1_2:p2p2_2 .
t:p1p1_2:p2p2_3 .
所需结果:程序应建议3个等式y1,y3和y6
Desired result: program should suggest 3 equations y1, y3 and y6
我是否忽略了明显的东西?
Do I overlook something obvious?
第三列是隐含噪声的虚拟因子.出于简化考虑,本专栏文章不予考虑
Column three is a dummy factor conatining noise. This column is not considered for simplicity
# Create testdata
sigma <- 0.5
t <- seq(0,10, length.out = 1000) # large sample of x values
# Create 3 linear equations of the form y_i = a*t_i + b
a <- c(3, 1, -1) # slope
b <- c(5, 3, 1) # offset
# create t_i, y_ti (theory) and y_i (including noise)
d <- list()
y <- list()
y_t <- list()
for (i in 1:3) {
set.seed(33*i)
d[[i]] <- sort(sample(t, 50, replace = F))
set.seed(33*i)
noise <- rnorm(10, 0, sigma)
y[[i]] <- a[i]*d[[i]] + b[i] + noise
y_t[[i]] <- a[i]*d[[i]] + b[i]
}
# Final data set
df1 <- data.frame(t=d[[1]], y=y[[1]], p1=rep("p1_1"), p2=rep("p2_1"),
p3=sample(c("p3_1", "p3_2", "p3_3"), length(d[[1]]), replace = T))
df2 <- data.frame(t=d[[2]], y=y[[2]], p1=rep("p1_2"), p2=rep("p2_2"),
p3=sample(c("p3_1", "p3_2", "p3_3"), length(d[[1]]), replace = T))
df3 <- data.frame(t=d[[3]], y=y[[3]], p1=rep("p1_2"), p2=rep("p2_3"),
p3=sample(c("p3_1", "p3_2", "p3_3"), length(d[[1]]), replace = T))
mydata <- rbind(df1, df2, df3)
mydata$p1 <- factor(mydata$p1)
mydata$p2 <- factor(mydata$p2)
mydata$p3 <- factor(mydata$p3)
mydata <- mydata[sample(nrow(mydata)), ]
# What the raw data looks like:
plot(x = mydata$t, y = mydata$y)
cols <- rainbow(length(levels(mydata$p1))*length(levels(mydata$p2))*length(levels(mydata$p3)))
rm(.Random.seed, envir=.GlobalEnv)
cols <- sample(cols) # most likely similar colors are not next to each other;-)
# Fit using lm disabled - just uncomment and comment the part below
# fit <- lm(y ~ t * p1 * p2, data = mydata)
# coef <- as.matrix(fit$coefficients)
# mydata$pred <- predict(fit)
# Fit using glmnet
set.seed(42)
fit_type <- c("lambda.min", "lambda.1se")[1]
x <- model.matrix(y ~ t * p1 * p2, data = mydata)[,-1]
fit <- glmnet::cv.glmnet(x, mydata$y, intercept = TRUE, nfolds = 10, alpha = 1)
coef <- glmnet::coef.cv.glmnet(fit, newx = x, s = fit_type)
mydata$pred <- predict(fit, newx = x, s = fit_type)
# plots
plot(d[[1]], y_t[[1]], type = "l", lty = 3, col = "black", main = "Raw data",
xlim = c(0, 10), ylim = c(min(mydata$y), max(mydata$y)), xlab = "t", ylab = "y")
lines(d[[2]], y_t[[2]], col = "black", lty = 3)
lines(d[[3]], y_t[[3]], col = "black", lty = 3)
# The following for loops are fixed right now. In the end this should be automated using
# the input from the fit (and the knowledge how to extract N and the lines above).
pn <- 0
for (p1 in 1:length(levels(mydata$p1))) {
for (p2 in 1:length(levels(mydata$p2))) {
pn <- pn + 1
tmp <- mydata[mydata$p1 == levels(mydata$p1)[p1] & mydata$p2 == levels(mydata$p2)[p2], ]
points(x = tmp$t, y = tmp$y, col = cols[pn]) # original data
points(x = tmp$t, y = tmp$pred, col = cols[pn], pch = 3) # estimated data from predict
if (length(tmp$pred) > 0) {
abline(lm(tmp$pred ~ tmp$t), col = cols[pn])
}
}
}
相关帖子:
- 基于子组的线性回归:
显示了如何使用多级分析.对我来说,它仍然没有解释如何获得最佳拟合线. ggplot2显示其中的6个,但对我来说
这是一个谜.请注意,我使用了一组不同的测试数据,这些数据更易于解释(行分隔得很好,噪音较小,整数
a
和b
). - 具有不同颜色的不同级别: 解释了在行数已知且所有级别均相关的情况下如何显示行.
- linear regression based on subgroups:
shows how to use multilevel analysis. For me it still does not explain how to obtain the best-fit lines. The ggplot2 displays 6 of them, but to me
this is a mystery. Please note that I use a different set of test data which is much easier to interpret (lines well separated, less noise, integer
a
andb
). - different levels with different colors: explains how to display the lines if the number of lines is known and all levels are relevant.
推荐答案
我认为您误解了回归结果.如果方程式包含项p1_m
和p2_n
,则它还必须包含相互作用t:p1_m
和t:p2_n
.它不能是一个,不能是另一个.在样本数据中,有三对系数:
I think you are misinterpreting the regression results. If an equation contains the terms p1_m
and p2_n
, then it must also contain the interactions t:p1_m
and t:p2_n
. It cannot be one and not the other. In the sample data there are three pairs of coefficients:
> unique(mydata[,3:4])
# p1 p2
# 96 p1_2 p2_2
# 1 p1_1 p2_1
# 135 p1_2 p2_3
根据lm
结果,我们将等式重建为:
Looking at the lm
results, we reconstruct the equations as:
-
y = 5 + 3t + p1p1_2 + (t:p1p1_2)*t + p2p2_2 + (t:p2p2_2)*t = 3 + t
; -
y = 5 + 3t + p1p1_1 + (t:p1p1_1)*t + p2p2_1 + (t:p2p2_1)*t = 5 + 3t
; -
y = 5 + 3t + p1p1_2 + (t:p1p1_2)*t + p2p2_3 + (t:p2p2_3)*t = 1 - t
.
y = 5 + 3t + p1p1_2 + (t:p1p1_2)*t + p2p2_2 + (t:p2p2_2)*t = 3 + t
;y = 5 + 3t + p1p1_1 + (t:p1p1_1)*t + p2p2_1 + (t:p2p2_1)*t = 5 + 3t
;y = 5 + 3t + p1p1_2 + (t:p1p1_2)*t + p2p2_3 + (t:p2p2_3)*t = 1 - t
.
这些与您在开始时指定的方程式匹配,因此没有歧义.
These match the equations that you specify at the beginning, so there is no ambiguity.
这篇关于如何从线性拟合中提取行数和相应的方程式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!