快速分组简单线性回归 [英] Fast group-by simple linear regression
问题描述
此问与答; 如何使group_by和lm快速运行?数据框.
理论上,一系列分组回归y ~ x | g
等效于单个合并回归y ~ x * g
.后者非常吸引人,因为不同组之间的统计检验非常简单.但是实际上,进行这种较大的回归计算并不容易.我对链接的问答的回答评论包speedlm
和glm4
,但指出它们不能很好地解决此问题.
In theory, a series of group-by regression y ~ x | g
is equivalent to a single pooled regression y ~ x * g
. The latter is very appealing because statistical test between different groups is straightforward. But in practice doing this larger regression is not computationally easy. My answer on the linked Q & A reviews packages speedlm
and glm4
, but pointed out that they can't well address this problem.
大型回归问题很难解决,尤其是在存在因子变量的情况下.这可以解释为什么许多人放弃了这个想法,而是更喜欢按组划分数据并按组拟合模型.我没有必要列举分组回归的方法(请参阅线性回归和R中的分组依据).我在乎的是速度.
Large regression problem is difficult, particularly when there are factor variables. This may explain why many people abandon this idea and prefer to splitting data by group and fitting models by group. There is no point for me to enumerate methods on group-by regression (see Linear Regression and group by in R). What I care is speed.
对于像y ~ x | g
这样的简单线性回归,可以按组划分数据,然后依靠像lm
这样的标准模型拟合例程,这会降低性能.首先,子集大数据帧效率低下.其次,标准模型拟合例程遵循以下过程,这对于有用的回归计算而言是巨大的开销.
For simple linear regression as y ~ x | g
, splitting data by group then relying on standard model fitting routines like lm
is a performance killer. First of all, subsetting a large data frame is inefficient. Secondly, standard model fitting routines follow the procedure below, which are sheer overhead to the useful regression computation.
- 将模型公式解析为"terms"对象(使用
terms.formula
); - 构建模型框架(使用
model.frame.default
); - 建立模型矩阵(使用
model.matrix.default
).
- parse model formula to "terms" object (using
terms.formula
); - construct model frame (using
model.frame.default
); - build model matrix (using
model.matrix.default
).
对于简单的线性回归,有巧妙的计算技巧.正如我在数据框中变量之间的快速成对简单线性回归中所演示的那样,协方差方法非常快.我们可以通过group_by_simpleLM
函数通过简单的线性回归使它适应分组吗?
There are clever computational tricky for simple linear regression. As I demonstrated in Fast pairwise simple linear regression between variables in a data frame, the covariance method is extremely fast. Can we adapt it to group-by simple linear regression via a group_by_simpleLM
function?
推荐答案
我们必须通过编写编译后的代码来做到这一点.我会用Rcpp来介绍这个.请注意,我是C程序员,并且一直在使用R的常规C接口. Rcpp只是用于简化列表,字符串和属性的处理,并有助于在R中立即进行测试.该代码主要是用C风格编写的. R的常规C接口(如REAL
和INTEGER
)中的宏仍在使用.有关"group_by_simpleLM.cpp"的信息,请参见此答案的底部.
We have to do this by writing compiled code. I would present this with Rcpp. Be aware that I am a C programmer and have been using R's conventional C interface. Rcpp is used just to ease handling of lists, strings and attributes, as well as to facilitate immediate testing in R. The code is largely written in C-style. Macros from R's conventional C interface like REAL
and INTEGER
are still used. See the bottom of this answer for "group_by_simpleLM.cpp".
R包装函数group_by_simpleLM
具有四个参数:
The R wrapper function group_by_simpleLM
has four arguments:
group_by_simpleLM <- function (dat, LHS, RHS, group) {
##.... [TRUNCATED]
-
dat
是一个数据帧.如果您输入矩阵或列表,它将停止并抱怨. -
LHS
是一个字符向量,在~
的左侧给出了变量的名称.支持多个LHS变量. -
RHS
是一个字符向量,在~
的右侧给出了变量的名称.简单线性回归中仅允许使用一个非因素RHS变量.您可以为RHS
提供变量向量,但是该函数将仅保留第一个元素(带有警告).如果在dat
中找不到该变量(可能是因为您键入了错误的名称),或者它不是数字变量,它将为您提供信息丰富的错误消息. -
group
是一个字符向量,给出了分组变量的名称.最好是dat
中的一个因素,否则该函数将match(group, unique(group))
用于快速强制并发出警告.水平未使用的因素无害.group_by_simpleLM_cpp
看到此情况,并返回所有NaN
此类级别的东西.group
可以为NULL
,以便对所有数据进行一次回归.
dat
is a data frame. If you feed a matrix or a list, it will stop and complain.LHS
is a character vector giving the names for the variables on the left hand side of~
. Multiple LHS variables are supported.RHS
is a character vector giving the names for the variables on the right hand side of~
. Only a single, non-factor RHS variable is allowed in simple linear regression. You can provide a vector of variables toRHS
, but the function will only retain the first element (with a warning). If that variable is not found indat
(maybe because you've mistyped the name) or it is not a numerical variable, it will give you an informative error message.group
is a character vector giving the name of the grouping variable. It should best be readily a factor indat
, otherwise the function usesmatch(group, unique(group))
for a fast coercion and a warning is issued. A factor with unused levels is no harm.group_by_simpleLM_cpp
sees this and returns you allNaN
for such levels.group
can beNULL
so that a single regression is done for all data.
主力函数group_by_simpleLM_cpp
返回矩阵的命名列表,以保存每个响应的回归结果.每个矩阵都是宽"的,具有nlevels(group)
列和5行:
The workhorse function group_by_simpleLM_cpp
returns a named list of matrices to hold regression results for each response. Each matrix is "wide" with nlevels(group)
columns and 5 rows:
- "alpha"(用于拦截);
- "beta"(斜率);
- "beta.se"(用于斜率的标准误差);
- "r2"(用于R平方);
- "df.resid"(用于剩余自由度);
该函数注意组中只有一个基准的秩不足情况.无法估计斜率,然后返回NaN
.另一个特殊情况是组中只有两个数据.这样,拟合就完美了,斜率的标准误差为0.
The function watches out for rank-deficient case where there is only one datum in a group. Slope can not be estimated then and NaN
is returned. Another special case is when a group only has two data. The fit is then perfect and you get 0 standard error for the slope.
该功能在group != NULL
时是nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE)
的快速方法,在group = NULL
时是lm(RHS ~ LHS, dat)
的快速方法(甚至比
The function is a fast method for nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE)
when group != NULL
, and a fast method for lm(RHS ~ LHS, dat)
when group = NULL
(may even be faster than general_paired_simpleLM because it is written in C).
警告:
- 加权回归不被处理,因为在这种情况下协方差方法无效.
- 不检查
dat
中的NA
/NaN
/Inf
/-Inf
,并且功能在存在时中断.
- weighted regression is not handled, because the covariance method is invalid in that case.
- no check for
NA
/NaN
/Inf
/-Inf
indat
is made and functions break in their presence.
library(Rcpp)
sourceCpp("group_by_simpleLM.cpp")
## a toy dataset
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
f = gl(2, 5, labels = letters[1:2]),
g = sample(gl(2, 5, labels = LETTERS[1:2])))
分组回归:nlme::lmList
group-by regression: a fast method of nlme::lmList
group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
#$y1
# a b
#alpha 0.820107094 -2.7164723
#beta -0.009796302 0.8812007
#beta.se 0.266690568 0.2090644
#r2 0.000449565 0.8555330
#df.resid 3.000000000 3.0000000
#
#$y2
# a b
#alpha 0.1304709 0.06996587
#beta -0.1616069 -0.14685953
#beta.se 0.2465047 0.24815024
#r2 0.1253142 0.10454374
#df.resid 3.0000000 3.00000000
fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)
## results for level "a"; use `fit[[2]]` to see results for level "b"
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.820107094 0.8845125 0.92718537 0.4222195
#x -0.009796302 0.2666906 -0.03673284 0.9730056
#
#$`Response y1`$r.squared
#[1] 0.000449565
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1304709 0.8175638 0.1595850 0.8833471
#x -0.1616069 0.2465047 -0.6555936 0.5588755
#
#$`Response y2`$r.squared
#[1] 0.1253142
处理等级缺陷而无故障
## with unused level "b"
group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 NaN
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 NaN
## rank-deficient case for level "b"
group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
#$y1
# a b
#alpha 0.820107094 -1.53995
#beta -0.009796302 NaN
#beta.se 0.266690568 NaN
#r2 0.000449565 NaN
#df.resid 3.000000000 0.00000
多个分组变量
当我们有多个分组变量时,group_by_simpleLM
无法直接处理它们.但是您可以使用interaction
首先创建一个单因子变量.
When we have more than one grouping variables, group_by_simpleLM
can not handle them directly. But you can use interaction
to first create a single factor variable.
dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
#$y1
# a:A b:A a:B b:B
#alpha 1.4750325 -2.7684583 -1.6393289 -1.8513669
#beta -0.2120782 0.9861509 0.7993313 0.4613999
#beta.se 0.0000000 0.2098876 0.4946167 0.0000000
#r2 1.0000000 0.9566642 0.7231188 1.0000000
#df.resid 0.0000000 1.0000000 1.0000000 0.0000000
#
#$y2
# a:A b:A a:B b:B
#alpha 1.0292956 -0.22746944 -1.5096975 0.06876360
#beta -0.2657021 -0.20650690 0.2547738 0.09172993
#beta.se 0.0000000 0.01945569 0.3483856 0.00000000
#r2 1.0000000 0.99120195 0.3484482 1.00000000
#df.resid 0.0000000 1.00000000 1.0000000 0.00000000
fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)
## note that the first group a:A only has two values, so df.resid = 0
## my method returns 0 standard error for the slope
## but lm or lmList would return NaN
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.4750325 NaN NaN NaN
#x -0.2120782 NaN NaN NaN
#
#$`Response y1`$r.squared
#[1] 1
#
#
#$`Response y2`
#$`Response y2`$coefficients
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0292956 NaN NaN NaN
#x -0.2657021 NaN NaN NaN
#
#$`Response y2`$r.squared
#[1] 1
不分组:lm
no grouping: a fast method of lm
group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
#$y1
# ALL
#alpha -0.9481826
#beta 0.4357022
#beta.se 0.2408162
#r2 0.2903691
#df.resid 8.0000000
#
#$y2
# ALL
#alpha 0.1002184
#beta -0.1542332
#beta.se 0.1514935
#r2 0.1147012
#df.resid 8.0000000
快速的大型简单线性回归
set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
day = 3:7,
y1 = runif(nr),
y2 = rpois(nr, 3),
y3 = rnorm(nr),
y4 = rnorm(nr, 1, 5))
system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
# user system elapsed
# 0.200 0.016 0.219
library(MatrixModels)
system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
# user system elapsed
# 9.012 0.172 9.266
group_by_simpleLM
几乎立即执行所有4个响应,而glm4
仅一个响应就需要9s!
group_by_simpleLM
does all 4 responses almost instantly, while glm4
needs 9s for one response alone!
请注意,glm4
在排名不足的情况下可能会崩溃,而group_by_simpleLM
则不会.
Note that glm4
can break down in rank-deficient case, while group_by_simpleLM
would not.
附录:"group_by_simpleLM.cpp"
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {
/* number of data and number of responses */
int n = x.size(), k = Y.size(), n_groups = group_levels.size();
/* set up result list */
List result(k);
List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
int j; for (j = 0; j < k; j++) {
NumericMatrix mat(5, n_groups);
mat.attr("dimnames") = dimnames;
result[j] = mat;
}
result.attr("names") = Y.attr("names");
/* set up a vector to hold sample size for each group */
size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));
/*
compute group offset: cumsum(group_offset)
The offset is used in a different way when group is sorted or unsorted
In the former case, it is the offset to real x, y values;
In the latter case, it is the offset to ordering index indx
*/
int *u = INTEGER(group), *u_end = u + n, i;
if (n_groups > 1) {
while (u < u_end) group_offset[*u++]++;
for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
} else {
group_offset[1] = n;
group_unsorted = 0;
}
/* local variables & pointers */
double *xi, *xi_end; /* pointer to the 1st and the last x value */
double *yi; /* pointer to the first y value */
int gi; double inv_gi; /* sample size of the i-th group */
double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
double xiyi_cov; /* covariance between x and y values in the i-th group */
double beta, r2; int df_resi;
double *matij;
/* additional storage and variables when group is unsorted */
int *indx; double *xb, *xbi, dtmp;
if (group_unsorted) {
indx = (int *)malloc(n * sizeof(int));
xb = (double *)malloc(n * sizeof(double)); // buffer x for caching
R_orderVector1(indx, n, group, TRUE, FALSE); // Er, how is TRUE & FALSE recogonized as Rboolean?
}
/* loop through groups */
for (i = 0; i < n_groups; i++) {
/* set group size gi */
gi = group_offset[i + 1] - group_offset[i];
/* special case for a factor level with no data */
if (gi == 0) {
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = R_NaN;
}
continue;
}
/* rank-deficient case */
if (gi == 1) {
gi = group_offset[i];
if (group_unsorted) gi = indx[gi];
for (j = 0; j < k; j++) {
/* matrix column for write-back */
matij = REAL(result[j]) + i * 5;
matij[0] = REAL(Y[j])[gi];
matij[1] = R_NaN; matij[2] = R_NaN;
matij[3] = R_NaN; matij[4] = 0.0;
}
continue;
}
/* general case where a regression line can be estimated */
inv_gi = 1 / (double)gi;
/* compute mean & variance of x values in this group */
xi_mean = 0.0; xi_var = 0.0;
if (group_unsorted) {
/* use u, u_end and xbi */
xi = REAL(x);
u = indx + group_offset[i]; /* offset acts on index */
u_end = u + gi;
xbi = xb + group_offset[i];
for (; u < u_end; xbi++, u++) {
dtmp = xi[*u];
xi_mean += dtmp;
xi_var += dtmp * dtmp;
*xbi = dtmp;
}
} else {
/* use xi and xi_end */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
xi_end = xi + gi;
for (; xi < xi_end; xi++) {
xi_mean += *xi;
xi_var += (*xi) * (*xi);
}
}
xi_mean = xi_mean * inv_gi;
xi_var = xi_var * inv_gi - xi_mean * xi_mean;
/* loop through responses doing simple linear regression */
for (j = 0; j < k; j++) {
/* compute mean & variance of y values, as well its covariance with x values */
yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
if (group_unsorted) {
xbi = xb + group_offset[i]; /* use buffered x values */
yi = REAL(Y[j]);
u = indx + group_offset[i]; /* offset acts on index */
for (; u < u_end; u++, xbi++) {
dtmp = yi[*u];
yi_mean += dtmp;
yi_var += dtmp * dtmp;
xiyi_cov += dtmp * (*xbi);
}
} else {
/* set xi and yi */
xi = REAL(x) + group_offset[i]; /* offset acts on values */
yi = REAL(Y[j]) + group_offset[i]; /* offset acts on values */
for (; xi < xi_end; xi++, yi++) {
yi_mean += *yi;
yi_var += (*yi) * (*yi);
xiyi_cov += (*yi) * (*xi);
}
}
yi_mean = yi_mean * inv_gi;
yi_var = yi_var * inv_gi - yi_mean * yi_mean;
xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
/* locate the right place to write back regression result */
matij = REAL(result[j]) + i * 5 + 4;
/* residual degree of freedom */
df_resi = gi - 2; *matij-- = (double)df_resi;
/* R-squared = squared correlation */
r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
/* standard error of regression slope */
if (df_resi == 0) *matij-- = 0.0;
else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
/* regression slope */
beta = xiyi_cov / xi_var; *matij-- = beta;
/* regression intercept */
*matij = yi_mean - beta * xi_mean;
}
}
if (group_unsorted) {
free(indx);
free(xb);
}
free(group_offset);
return result;
}
/*** R
group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {
## basic input validation
if (missing(dat)) stop("no data provided to 'dat'!")
if (!is.data.frame(dat)) stop("'dat' must be a data frame!")
if (missing(LHS)) stop("no 'LHS' provided!")
if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")
if (missing(RHS)) stop("no 'RHS' provided!")
if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")
if (!is.null(group)) {
## grouping variable provided: a fast method of `nlme::lmList`
if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")
## ensure that group has length 1, is available in the data frame and is a factor
if (length(group) > 1L) {
warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
group <- group[1L]
}
grp <- dat[[group]]
if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))
if (is.factor(grp)) {
grp_levels <- levels(grp)
} else {
warning("grouping variable is not provided as a factor; fast coercion is made!")
grp_levels <- unique(grp)
grp <- match(grp, grp_levels)
grp_levels <- as.character(grp_levels)
}
grp_unsorted <- .Internal(is.unsorted(grp, FALSE))
} else {
## no grouping; a fast method of `lm`
grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE
}
## the RHS must has length 1, is available in the data frame and is numeric
if (length(RHS) > 1L) {
warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
RHS <- RHS[1L]
}
x <- dat[[RHS]]
if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
if (!is.numeric(x) || is.factor(x)) {
stop("RHS variable must be 'numeric' for simple linear regression!")
}
x < as.numeric(x) ## just in case that `x` is an integer
## check LHS variables
nested <- match(RHS, LHS, nomatch = 0L)
if (nested > 0L) {
warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
LHS <- LHS[-nested]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
missed <- !(LHS %in% names(dat))
if (any(missed)) {
warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
LHS <- LHS[!missed]
}
if (length(LHS) == 0L) stop("no usable LHS variables found!")
Y <- dat[LHS]
invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
if (any(invalid_LHS)) {
warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
Y <- Y[!invalid_LHS]
}
if (length(Y) == 0L) stop("no usable LHS variables found!")
Y <- lapply(Y, as.numeric) ## just in case that we have integer variables in Y
## check for exsitence of NA, NaN, Inf and -Inf and drop them?
## use Rcpp
group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
}
*/
这篇关于快速分组简单线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!