快速分组简单线性回归 [英] Fast group-by simple linear regression

查看:174
本文介绍了快速分组简单线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问与答; 如何使group_by和lm快速运行?数据框.

理论上,一系列分组回归y ~ x | g等效于单个合并回归y ~ x * g.后者非常吸引人,因为不同组之间的统计检验非常简单.但是实际上,进行这种较大的回归计算并不容易.我对链接的问答的回答评论包speedlmglm4,但指出它们不能很好地解决此问题.

In theory, a series of group-by regression y ~ x | g is equivalent to a single pooled regression y ~ x * g. The latter is very appealing because statistical test between different groups is straightforward. But in practice doing this larger regression is not computationally easy. My answer on the linked Q & A reviews packages speedlm and glm4, but pointed out that they can't well address this problem.

大型回归问题很难解决,尤其是在存在因子变量的情况下.这可以解释为什么许多人放弃了这个想法,而是更喜欢按组划分数据并按组拟合模型.我没有必要列举分组回归的方法(请参阅线性回归和R中的分组依据).我在乎的是速度.

Large regression problem is difficult, particularly when there are factor variables. This may explain why many people abandon this idea and prefer to splitting data by group and fitting models by group. There is no point for me to enumerate methods on group-by regression (see Linear Regression and group by in R). What I care is speed.

对于像y ~ x | g这样的简单线性回归,可以按组划分数据,然后依靠像lm这样的标准模型拟合例程,这会降低性能.首先,子集大数据帧效率低下.其次,标准模型拟合例程遵循以下过程,这对于有用的回归计算而言是巨大的开销.

For simple linear regression as y ~ x | g, splitting data by group then relying on standard model fitting routines like lm is a performance killer. First of all, subsetting a large data frame is inefficient. Secondly, standard model fitting routines follow the procedure below, which are sheer overhead to the useful regression computation.

  1. 将模型公式解析为"terms"对象(使用terms.formula);
  2. 构建模型框架(使用model.frame.default);
  3. 建立模型矩阵(使用model.matrix.default).
  1. parse model formula to "terms" object (using terms.formula);
  2. construct model frame (using model.frame.default);
  3. build model matrix (using model.matrix.default).

对于简单的线性回归,有巧妙的计算技巧.正如我在数据框中变量之间的快速成对简单线性回归中所演示的那样,协方差方法非常快.我们可以通过group_by_simpleLM函数通过简单的线性回归使它适应分组吗?

There are clever computational tricky for simple linear regression. As I demonstrated in Fast pairwise simple linear regression between variables in a data frame, the covariance method is extremely fast. Can we adapt it to group-by simple linear regression via a group_by_simpleLM function?

推荐答案

我们必须通过编写编译后的代码来做到这一点.我会用Rcpp来介绍这个.请注意,我是C程序员,并且一直在使用R的常规C接口. Rcpp只是用于简化列表,字符串和属性的处理,并有助于在R中立即进行测试.该代码主要是用C风格编写的. R的常规C接口(如REALINTEGER)中的宏仍在使用.有关"group_by_simpleLM.cpp"的信息,请参见此答案的底部.

We have to do this by writing compiled code. I would present this with Rcpp. Be aware that I am a C programmer and have been using R's conventional C interface. Rcpp is used just to ease handling of lists, strings and attributes, as well as to facilitate immediate testing in R. The code is largely written in C-style. Macros from R's conventional C interface like REAL and INTEGER are still used. See the bottom of this answer for "group_by_simpleLM.cpp".

R包装函数group_by_simpleLM具有四个参数:

The R wrapper function group_by_simpleLM has four arguments:

group_by_simpleLM <- function (dat, LHS, RHS, group) {
##.... [TRUNCATED]

  • dat是一个数据帧.如果您输入矩阵或列表,它将停止并抱怨.
  • LHS是一个字符向量,在~的左侧给出了变量的名称.支持多个LHS变量.
  • RHS是一个字符向量,在~的右侧给出了变量的名称.简单线性回归中仅允许使用一个非因素RHS变量.您可以为RHS提供变量向量,但是该函数将仅保留第一个元素(带有警告).如果在dat中找不到该变量(可能是因为您键入了错误的名称),或者它不是数字变量,它将为您提供信息丰富的错误消息.
  • group是一个字符向量,给出了分组变量的名称.最好是dat中的一个因素,否则该函数将match(group, unique(group))用于快速强制并发出警告.水平未使用的因素无害. group_by_simpleLM_cpp看到此情况,并返回所有NaN此类级别的东西. group可以为NULL,以便对所有数据进行一次回归.
  • dat is a data frame. If you feed a matrix or a list, it will stop and complain.
  • LHS is a character vector giving the names for the variables on the left hand side of ~. Multiple LHS variables are supported.
  • RHS is a character vector giving the names for the variables on the right hand side of ~. Only a single, non-factor RHS variable is allowed in simple linear regression. You can provide a vector of variables to RHS, but the function will only retain the first element (with a warning). If that variable is not found in dat (maybe because you've mistyped the name) or it is not a numerical variable, it will give you an informative error message.
  • group is a character vector giving the name of the grouping variable. It should best be readily a factor in dat, otherwise the function uses match(group, unique(group)) for a fast coercion and a warning is issued. A factor with unused levels is no harm. group_by_simpleLM_cpp sees this and returns you all NaN for such levels. group can be NULL so that a single regression is done for all data.

主力函数group_by_simpleLM_cpp返回矩阵的命名列表,以保存每个响应的回归结果.每个矩阵都是宽"的,具有nlevels(group)列和5行:

The workhorse function group_by_simpleLM_cpp returns a named list of matrices to hold regression results for each response. Each matrix is "wide" with nlevels(group) columns and 5 rows:

  • "alpha"(用于拦截);
  • "beta"(斜率);
  • "beta.se"(用于斜率的标准误差);
  • "r2"(用于R平方);
  • "df.resid"(用于剩余自由度);

对于简单的线性回归,这五个统计数据足以获取其他统计数据.

该函数注意组中只有一个基准的秩不足情况.无法估计斜率,然后返回NaN.另一个特殊情况是组中只有两个数据.这样,拟合就完美了,斜率的标准误差为0.

The function watches out for rank-deficient case where there is only one datum in a group. Slope can not be estimated then and NaN is returned. Another special case is when a group only has two data. The fit is then perfect and you get 0 standard error for the slope.

该功能在group != NULL时是nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE)的快速方法,在group = NULL时是lm(RHS ~ LHS, dat)的快速方法(甚至比

The function is a fast method for nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE) when group != NULL, and a fast method for lm(RHS ~ LHS, dat) when group = NULL (may even be faster than general_paired_simpleLM because it is written in C).

警告:

  • 加权回归不被处理,因为在这种情况下协方差方法无效.
  • 不检查dat中的NA/NaN/Inf/-Inf,并且功能在存在时中断.
  • weighted regression is not handled, because the covariance method is invalid in that case.
  • no check for NA / NaN / Inf / -Inf in dat is made and functions break in their presence.
library(Rcpp)
sourceCpp("group_by_simpleLM.cpp")

## a toy dataset
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
                  f = gl(2, 5, labels = letters[1:2]),
                  g = sample(gl(2, 5, labels = LETTERS[1:2])))

分组回归:nlme::lmList

group-by regression: a fast method of nlme::lmList

group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
#$y1
#                    a          b
#alpha     0.820107094 -2.7164723
#beta     -0.009796302  0.8812007
#beta.se   0.266690568  0.2090644
#r2        0.000449565  0.8555330
#df.resid  3.000000000  3.0000000
#
#$y2
#                  a           b
#alpha     0.1304709  0.06996587
#beta     -0.1616069 -0.14685953
#beta.se   0.2465047  0.24815024
#r2        0.1253142  0.10454374
#df.resid  3.0000000  3.00000000

fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)

## results for level "a"; use `fit[[2]]` to see results for level "b"
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
#                Estimate Std. Error     t value  Pr(>|t|)
#(Intercept)  0.820107094  0.8845125  0.92718537 0.4222195
#x           -0.009796302  0.2666906 -0.03673284 0.9730056
#
#$`Response y1`$r.squared
#[1] 0.000449565
#
#
#$`Response y2`
#$`Response y2`$coefficients
#              Estimate Std. Error    t value  Pr(>|t|)
#(Intercept)  0.1304709  0.8175638  0.1595850 0.8833471
#x           -0.1616069  0.2465047 -0.6555936 0.5588755
#
#$`Response y2`$r.squared
#[1] 0.1253142

处理等级缺陷而无故障

## with unused level "b"
group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
#$y1
#                    a   b
#alpha     0.820107094 NaN
#beta     -0.009796302 NaN
#beta.se   0.266690568 NaN
#r2        0.000449565 NaN
#df.resid  3.000000000 NaN

## rank-deficient case for level "b"
group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
#$y1
#                    a        b
#alpha     0.820107094 -1.53995
#beta     -0.009796302      NaN
#beta.se   0.266690568      NaN
#r2        0.000449565      NaN
#df.resid  3.000000000  0.00000

多个分组变量

当我们有多个分组变量时,group_by_simpleLM无法直接处理它们.但是您可以使用interaction首先创建一个单因子变量.

When we have more than one grouping variables, group_by_simpleLM can not handle them directly. But you can use interaction to first create a single factor variable.

dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
#$y1
#                a:A        b:A        a:B        b:B
#alpha     1.4750325 -2.7684583 -1.6393289 -1.8513669
#beta     -0.2120782  0.9861509  0.7993313  0.4613999
#beta.se   0.0000000  0.2098876  0.4946167  0.0000000
#r2        1.0000000  0.9566642  0.7231188  1.0000000
#df.resid  0.0000000  1.0000000  1.0000000  0.0000000
#
#$y2
#                a:A         b:A        a:B        b:B
#alpha     1.0292956 -0.22746944 -1.5096975 0.06876360
#beta     -0.2657021 -0.20650690  0.2547738 0.09172993
#beta.se   0.0000000  0.01945569  0.3483856 0.00000000
#r2        1.0000000  0.99120195  0.3484482 1.00000000
#df.resid  0.0000000  1.00000000  1.0000000 0.00000000

fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)

## note that the first group a:A only has two values, so df.resid = 0
## my method returns 0 standard error for the slope
## but lm or lmList would return NaN
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
#              Estimate Std. Error t value Pr(>|t|)
#(Intercept)  1.4750325        NaN     NaN      NaN
#x           -0.2120782        NaN     NaN      NaN
#
#$`Response y1`$r.squared
#[1] 1
#
#
#$`Response y2`
#$`Response y2`$coefficients
#              Estimate Std. Error t value Pr(>|t|)
#(Intercept)  1.0292956        NaN     NaN      NaN
#x           -0.2657021        NaN     NaN      NaN
#
#$`Response y2`$r.squared
#[1] 1

不分组:lm

no grouping: a fast method of lm

group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
#$y1
#                ALL
#alpha    -0.9481826
#beta      0.4357022
#beta.se   0.2408162
#r2        0.2903691
#df.resid  8.0000000
#
#$y2
#                ALL
#alpha     0.1002184
#beta     -0.1542332
#beta.se   0.1514935
#r2        0.1147012
#df.resid  8.0000000


快速的大型简单线性回归

set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
                 day = 3:7,
                 y1 = runif(nr), 
                 y2 = rpois(nr, 3), 
                 y3 = rnorm(nr), 
                 y4 = rnorm(nr, 1, 5))

system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
#   user  system elapsed 
#  0.200   0.016   0.219 

library(MatrixModels)
system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
#   user  system elapsed 
#  9.012   0.172   9.266 

group_by_simpleLM几乎立即执行所有4个响应,而glm4仅一个响应就需要9s!

group_by_simpleLM does all 4 responses almost instantly, while glm4 needs 9s for one response alone!

请注意,glm4在排名不足的情况下可能会崩溃,而group_by_simpleLM则不会.

Note that glm4 can break down in rank-deficient case, while group_by_simpleLM would not.

附录:"group_by_simpleLM.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {

  /* number of data and number of responses */
  int n = x.size(), k = Y.size(), n_groups = group_levels.size();

  /* set up result list */
  List result(k);
  List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
  int j; for (j = 0; j < k; j++) {
    NumericMatrix mat(5, n_groups);
    mat.attr("dimnames") = dimnames;
    result[j] = mat;
    }
  result.attr("names") = Y.attr("names");

  /* set up a vector to hold sample size for each group */
  size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));

  /*
    compute group offset: cumsum(group_offset)
    The offset is used in a different way when group is sorted or unsorted
    In the former case, it is the offset to real x, y values;
    In the latter case, it is the offset to ordering index indx
 */
  int *u = INTEGER(group), *u_end = u + n, i;
  if (n_groups > 1) {
    while (u < u_end) group_offset[*u++]++;
    for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
    } else {
    group_offset[1] = n;
    group_unsorted = 0;
    }

  /* local variables & pointers */
  double *xi, *xi_end;    /* pointer to the 1st and the last x value */
  double *yi;             /* pointer to the first y value */
  int gi; double inv_gi;  /* sample size of the i-th group */
  double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
  double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
  double xiyi_cov;        /* covariance between x and y values in the i-th group */
  double beta, r2; int df_resi;
  double *matij;

  /* additional storage and variables when group is unsorted */
  int *indx; double *xb, *xbi, dtmp;
  if (group_unsorted) {
    indx = (int *)malloc(n * sizeof(int));
    xb = (double *)malloc(n * sizeof(double));  // buffer x for caching
    R_orderVector1(indx, n, group, TRUE, FALSE);  // Er, how is TRUE & FALSE recogonized as Rboolean?
    }

  /* loop through groups */
  for (i = 0; i < n_groups; i++) {
    /* set group size gi */
    gi = group_offset[i + 1] - group_offset[i];
    /* special case for a factor level with no data */
    if (gi == 0) {
      for (j = 0; j < k; j++) {
        /* matrix column for write-back */
        matij = REAL(result[j]) + i * 5;
        matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
        matij[3] = R_NaN; matij[4] = R_NaN;
        }
      continue;
      }
    /* rank-deficient case */
    if (gi == 1) {
      gi = group_offset[i];
      if (group_unsorted) gi = indx[gi];
      for (j = 0; j < k; j++) {
        /* matrix column for write-back */
        matij = REAL(result[j]) + i * 5;
        matij[0] = REAL(Y[j])[gi];
        matij[1] = R_NaN; matij[2] = R_NaN;
        matij[3] = R_NaN; matij[4] = 0.0;
        }
      continue;
      }
    /* general case where a regression line can be estimated */
    inv_gi = 1 / (double)gi;
    /* compute mean & variance of x values in this group */
    xi_mean = 0.0; xi_var = 0.0;
    if (group_unsorted) {
      /* use u, u_end and xbi */
      xi = REAL(x);
      u = indx + group_offset[i];  /* offset acts on index */
      u_end = u + gi;
      xbi = xb + group_offset[i];
      for (; u < u_end; xbi++, u++) {
        dtmp = xi[*u];
        xi_mean += dtmp;
        xi_var += dtmp * dtmp;
        *xbi = dtmp;
        }
      } else {
      /* use xi and xi_end */
      xi = REAL(x) + group_offset[i];  /* offset acts on values */
      xi_end = xi + gi;
      for (; xi < xi_end; xi++) {
        xi_mean += *xi;
        xi_var += (*xi) * (*xi);
        }
      }
    xi_mean = xi_mean * inv_gi;
    xi_var = xi_var * inv_gi - xi_mean * xi_mean;
    /* loop through responses doing simple linear regression */
    for (j = 0; j < k; j++) {
      /* compute mean & variance of y values, as well its covariance with x values */
      yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
      if (group_unsorted) {
        xbi = xb + group_offset[i];  /* use buffered x values */
        yi = REAL(Y[j]);
        u = indx + group_offset[i];  /* offset acts on index */
        for (; u < u_end; u++, xbi++) {
          dtmp = yi[*u];
          yi_mean += dtmp;
          yi_var += dtmp * dtmp;
          xiyi_cov += dtmp * (*xbi);
          } 
        } else {
        /* set xi and yi */
        xi = REAL(x) + group_offset[i];  /* offset acts on values */
        yi = REAL(Y[j]) + group_offset[i];  /* offset acts on values */
        for (; xi < xi_end; xi++, yi++) {
          yi_mean += *yi;
          yi_var += (*yi) * (*yi);
          xiyi_cov += (*yi) * (*xi);
          }
        }
      yi_mean = yi_mean * inv_gi;
      yi_var = yi_var * inv_gi - yi_mean * yi_mean;
      xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
      /* locate the right place to write back regression result */
      matij = REAL(result[j]) + i * 5 + 4;
      /* residual degree of freedom */
      df_resi = gi - 2; *matij-- = (double)df_resi;
      /* R-squared = squared correlation */
      r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
      /* standard error of regression slope */
      if (df_resi == 0) *matij-- = 0.0;
      else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
      /* regression slope */
      beta = xiyi_cov / xi_var; *matij-- = beta;
      /* regression intercept */
      *matij = yi_mean - beta * xi_mean;
      }
    }

  if (group_unsorted) {
    free(indx);
    free(xb);
    }
  free(group_offset);
  return result;
  }

/*** R
group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {

  ## basic input validation
  if (missing(dat)) stop("no data provided to 'dat'!")
  if (!is.data.frame(dat)) stop("'dat' must be a data frame!")

  if (missing(LHS)) stop("no 'LHS' provided!")
  if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")

  if (missing(RHS)) stop("no 'RHS' provided!")
  if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")

  if (!is.null(group)) {

    ## grouping variable provided: a fast method of `nlme::lmList`

    if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")

    ## ensure that group has length 1, is available in the data frame and is a factor
    if (length(group) > 1L) {
      warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
      group <- group[1L]
      }
    grp <- dat[[group]]
    if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))

    if (is.factor(grp)) {
      grp_levels <- levels(grp)
      } else {
      warning("grouping variable is not provided as a factor; fast coercion is made!")
      grp_levels <- unique(grp)
      grp <- match(grp, grp_levels)
      grp_levels <- as.character(grp_levels)
      }

    grp_unsorted <- .Internal(is.unsorted(grp, FALSE))

    } else {

    ## no grouping; a fast method of `lm`
    grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE

    }

  ## the RHS must has length 1, is available in the data frame and is numeric
  if (length(RHS) > 1L) {
    warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
    RHS <- RHS[1L]
    }
  x <- dat[[RHS]]
  if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
  if (!is.numeric(x) || is.factor(x)) {
    stop("RHS variable must be 'numeric' for simple linear regression!")
    }
  x < as.numeric(x)  ## just in case that `x` is an integer

  ## check LHS variables
  nested <- match(RHS, LHS, nomatch = 0L)
  if (nested > 0L) {
    warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
    LHS <- LHS[-nested]
    }
  if (length(LHS) == 0L) stop("no usable LHS variables found!")
  missed <- !(LHS %in% names(dat))
  if (any(missed)) {
    warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
    LHS <- LHS[!missed]
    }
  if (length(LHS) == 0L) stop("no usable LHS variables found!")
  Y <- dat[LHS]
  invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
  if (any(invalid_LHS)) {
    warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
    Y <- Y[!invalid_LHS]
    }
  if (length(Y) == 0L) stop("no usable LHS variables found!")
  Y <- lapply(Y, as.numeric)  ## just in case that we have integer variables in Y

  ## check for exsitence of NA, NaN, Inf and -Inf and drop them?

  ## use Rcpp
  group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
  }
*/

这篇关于快速分组简单线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆