数据框中变量之间的快速成对简单线性回归 [英] Fast pairwise simple linear regression between variables in a data frame

查看:85
本文介绍了数据框中变量之间的快速成对简单线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Stack Overflow上看到了成对或一般成对的简单线性回归.这是解决此类问题的玩具数据集.

I have seen pairwise or general paired simple linear regression many times on Stack Overflow. Here is a toy dataset for this kind of problem.

set.seed(0)
X <- matrix(runif(100), 100, 5, dimnames = list(1:100, LETTERS[1:5]))
b <- c(1, 0.7, 1.3, 2.9, -2)
dat <- X * b[col(X)] + matrix(rnorm(100 * 5, 0, 0.1), 100, 5)
dat <- as.data.frame(dat)
pairs(dat)

所以基本上我们想计算5 * 4 = 20条回归线:

So basically we want to compute 5 * 4 = 20 regression lines:

-----  A ~ B  A ~ C  A ~ D  A ~ E
B ~ A  -----  B ~ C  B ~ D  B ~ E
C ~ A  C ~ B  -----  C ~ D  C ~ E
D ~ A  D ~ B  D ~ C  -----  D ~ E
E ~ A  E ~ B  E ~ C  E ~ D  -----

这是穷人的策略:

poor <- function (dat) {
  n <- nrow(dat)
  p <- ncol(dat)
  ## all formulae
  LHS <- rep(colnames(dat), p)
  RHS <- rep(colnames(dat), each = p)
  ## function to fit and summarize a single model
  fitmodel <- function (LHS, RHS) {
    if (RHS == LHS) {
      z <- data.frame("LHS" = LHS, "RHS" = RHS,
                      "alpha" = 0,
                      "beta" = 1,
                      "beta.se" = 0,
                      "beta.tv" = Inf,
                      "beta.pv" = 0,
                      "sig" = 0,
                      "R2" = 1,
                      "F.fv" = Inf,
                      "F.pv" = 0,
                      stringsAsFactors = FALSE)
      } else {
      result <- summary(lm(reformulate(RHS, LHS), data = dat))
      z <- data.frame("LHS" = LHS, "RHS" = RHS,
                      "alpha" = result$coefficients[1, 1],
                      "beta" = result$coefficients[2, 1],
                      "beta.se" = result$coefficients[2, 2],
                      "beta.tv" = result$coefficients[2, 3],
                      "beta.pv" = result$coefficients[2, 4],
                      "sig" = result$sigma,
                      "R2" = result$r.squared,
                      "F.fv" = result$fstatistic[[1]],
                      "F.pv" = pf(result$fstatistic[[1]], 1, n - 2, lower.tail = FALSE),
                      stringsAsFactors = FALSE)
        }
      z
      }
  ## loop through all models
  do.call("rbind.data.frame", c(Map(fitmodel, LHS, RHS),
                                list(make.row.names = FALSE,
                                     stringsAsFactors = FALSE)))
  }

逻辑很明确:获取所有对,构造模型公式(reformulate),拟合回归(lm),进行汇总summary,返回所有统计数据,并rbind将它们作为数据框架.

The logic is clear: get all pairs, construct the model formula (reformulate), fit a regression (lm), do a summary summary, return all statistics and rbind them to be a data frame.

好的,但是如果有p个变量怎么办?然后,我们需要进行p * (p - 1)回归!

OK, fine, but what if there are p variables? We then need to do p * (p - 1) regressions!

我能想到的直接改进是拟合具有多个LHS的线性模型.例如,该公式矩阵的第一列被合并到

An immediate improvement I could think of, is Fitting a linear model with multiple LHS. For example, the first column of that formula matrix is merged to

cbind(B, C, D, E) ~ A

这会将回归次数从p * (p - 1)减少到p.

This reduces the number of regression from p * (p - 1) to p.

但是,如果不使用lmsummary,我们绝对可以做得更好.这是我以前的尝试:是否可以快速估算简单回归(仅具有截距和斜率的回归线)?.它之所以快速,是因为它使用变量之间的协方差进行估计,例如求解正态方程.但是其中的simpleLM功能非常有限:

But we can definitely do even better without using lm and summary. Here is my previous attempt: Is there a fast estimation of simple regression (a regression line with only intercept and slope)?. It is fast because it uses covariance between variables for estimation, like solving the normal equation. But the simpleLM function there is pretty limited:

  1. 需要计算残差矢量以估计残差标准误差,这是性能瓶颈;
  2. 它不支持多个LHS,因此在成对回归设置中需要被称为p * (p - 1)次.

我们可以通过编写函数pairwise_simpleLM将其概括为快速成对回归吗?

Can we generalize it for fast pairwise regression, by writing a function pairwise_simpleLM?

上述成对回归的一个更有用的变化是一组LHS变量和一组RHS变量之间的一般成对回归.

A more useful variation of the above pairwise regression is the general paired regression between a set of LHS variables and a set of RHS variables.

示例1

LHS变量ABC和RHS变量DE之间的拟合配对回归,即拟合6条简单线性回归线:

Fit paired regression between LHS variables A, B, C and RHS variables D, E, that is, fit 6 simple linear regression lines:

A ~ D  A ~ E
B ~ D  B ~ E
C ~ D  C ~ E

示例2

将具有多个LHS变量的简单线性回归拟合到特定的RHS变量,例如:cbind(A, B, C, D) ~ E.

Fit a simple linear regression with multiple LHS variables to a particular RHS variable, say: cbind(A, B, C, D) ~ E.

示例3

使用一个特定的LHS变量和一组RHS变量一次进行简单的线性回归,例如:

Fit a simple linear regression with a particular LHS variable, and a set of RHS variables one at a time, for example:

A ~ B  A ~ C  A ~ D  A ~ E 

我们还能为此提供快速功能吗general_paired_simpleLM?

Can we also have a fast function general_paired_simpleLM for this?

警告

  1. 所有变量必须为数字;因素是不允许的,或者成对回归是没有意义的.
  2. 没有讨论加权回归,因为在这种情况下没有方差-协方差方法.

推荐答案

一些统计结果/背景

(图片中的链接:用于计算R2中的R2(R平方)的函数)

这里涉及的计算基本上是方差-协方差矩阵的计算.一旦有了它,所有成对回归的结果就是逐元素矩阵算术.

Computations involved here is basically the computation of the variance-covariance matrix. Once we have it, results for all pairwise regression is just element-wise matrix arithmetic.

方差-协方差矩阵可以通过R函数cov获得,但下面的函数可以使用crossprod手动计算.好处是,如果有的话,显然可以从优化的BLAS库中受益.请注意,以这种方式进行了大量简化. R函数cov具有参数use,该参数允许处理NA,但crossprod不允许.我假设您的dat根本没有缺失值!如果确实缺少值,请使用na.omit(dat)自己删除它们.

The variance-covariance matrix can be obtained by R function cov, but functions below compute it manually using crossprod. The advantage is that it can obviously benefit from an optimized BLAS library if you have it. Be aware that significant amount of simplification is made in this way. R function cov has argument use which allows handling NA, but crossprod does not. I am assuming that your dat has no missing values at all! If you do have missing values, remove them yourself with na.omit(dat).

将数据帧转换为矩阵的初始as.matrix可能是开销.原则上,如果我使用C/C ++编写所有代码,则可以消除这种强制性.实际上,许多基于元素的矩阵矩阵算法都可以合并到单个循环嵌套中.但是,我现在真的很烦(因为我没有时间).

The initial as.matrix that converts a data frame to a matrix might be an overhead. In principle if I code everything up in C / C++, I can eliminate this coercion. And in fact, many element-wise matrix matrix arithmetic can be merged into a single loop-nest. However, I really bother doing this at the moment (as I have no time).

有些人可能会认为最终回报的格式不方便.可能还有其他格式:

Some people may argue that the format of the final return is inconvenient. There could be other format:

  1. 数据帧列表,每个数据帧给出特定LHS变量的回归结果;
  2. 数据帧列表,每个数据帧都给出特定RHS变量的回归结果.

这实际上是基于意见的.无论如何,您始终可以在我返回给您的数据框中自己通过"LHS"列或"RHS"列进行split.data.frame.

This is really opinion-based. Anyway, you can always do a split.data.frame by "LHS" column or "RHS" column yourself on the data frame I return you.

pairwise_simpleLM <- function (dat) {
  ## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
  dat <- as.matrix(dat)
  n <- nrow(dat)
  p <- ncol(dat)
  ## variable summary: mean, (unscaled) covariance and (unscaled) variance
  m <- colMeans(dat)
  V <- crossprod(dat) - tcrossprod(m * sqrt(n))
  d <- diag(V)
  ## R-squared (explained variance) and its complement
  R2 <- (V ^ 2) * tcrossprod(1 / d)
  R2_complement <- 1 - R2
  R2_complement[seq.int(from = 1, by = p + 1, length = p)] <- 0
  ## slope and intercept
  beta <- V * rep(1 / d, each = p)
  alpha <- m - beta * rep(m, each = p)
  ## residual sum of squares and standard error
  RSS <- R2_complement * d
  sig <- sqrt(RSS * (1 / (n - 2)))
  ## statistics for slope
  beta.se <- sig * rep(1 / sqrt(d), each = p)
  beta.tv <- beta / beta.se
  beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
  ## F-statistic and p-value
  F.fv <- (n - 2) * R2 / R2_complement
  F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
  ## export
  data.frame(LHS = rep(colnames(dat), times = p),
             RHS = rep(colnames(dat), each = p),
             alpha = c(alpha),
             beta = c(beta),
             beta.se = c(beta.se),
             beta.tv = c(beta.tv),
             beta.pv = c(beta.pv),
             sig = c(sig),
             R2 = c(R2),
             F.fv = c(F.fv),
             F.pv = c(F.pv),
             stringsAsFactors = FALSE)
  }

让我们在问题中的玩具数据集上比较结果.

Let's compare the result on the toy dataset in the question.

oo <- poor(dat)
rr <- pairwise_simpleLM(dat)
all.equal(oo, rr)
#[1] TRUE

让我们看看它的输出:

rr[1:3, ]
#  LHS RHS      alpha      beta    beta.se  beta.tv      beta.pv       sig
#1   A   A 0.00000000 1.0000000 0.00000000      Inf 0.000000e+00 0.0000000
#2   B   A 0.05550367 0.6206434 0.04456744 13.92594 5.796437e-25 0.1252402
#3   C   A 0.05809455 1.2215173 0.04790027 25.50126 4.731618e-45 0.1346059
#         R2     F.fv         F.pv
#1 1.0000000      Inf 0.000000e+00
#2 0.6643051 193.9317 5.796437e-25
#3 0.8690390 650.3142 4.731618e-45

当我们的LHS和RHS相同时,回归是没有意义的,因此拦截为0,斜率为1,依此类推.

When we have the same LHS and RHS, regression is meaningless hence intercept is 0, slope is 1, etc.

速度如何?仍使用此玩具示例:

What about speed? Still using this toy example:

library(microbenchmark)
microbenchmark("poor_man's" = poor(dat), "fast" = pairwise_simpleLM(dat))
#Unit: milliseconds
#       expr        min         lq       mean     median         uq       max
# poor_man's 127.270928 129.060515 137.813875 133.390722 139.029912 216.24995
#       fast   2.732184   3.025217   3.381613   3.134832   3.313079  10.48108

随着我们拥有更多变量,差距将越来越大.例如,我们有10个变量:

The gap is going be increasingly wider as we have more variables. For example, with 10 variables we have:

set.seed(0)
X <- matrix(runif(100), 100, 10, dimnames = list(1:100, LETTERS[1:10]))
b <- runif(10)
DAT <- X * b[col(X)] + matrix(rnorm(100 * 10, 0, 0.1), 100, 10)
DAT <- as.data.frame(DAT)
microbenchmark("poor_man's" = poor(DAT), "fast" = pairwise_simpleLM(DAT))
#Unit: milliseconds
#       expr        min         lq       mean     median        uq        max
# poor_man's 548.949161 551.746631 573.009665 556.307448 564.28355 801.645501
#       fast   3.365772   3.578448   3.721131   3.621229   3.77749   6.791786


R函数general_paired_simpleLM


R function general_paired_simpleLM

general_paired_simpleLM <- function (dat_LHS, dat_RHS) {
  ## matrix and its dimension (n: numbeta.ser of data; p: numbeta.ser of variables)
  dat_LHS <- as.matrix(dat_LHS)
  dat_RHS <- as.matrix(dat_RHS)
  if (nrow(dat_LHS) != nrow(dat_RHS)) stop("'dat_LHS' and 'dat_RHS' don't have same number of rows!")
  n <- nrow(dat_LHS)
  pl <- ncol(dat_LHS)
  pr <- ncol(dat_RHS)
  ## variable summary: mean, (unscaled) covariance and (unscaled) variance
  ml <- colMeans(dat_LHS)
  mr <- colMeans(dat_RHS)
  vl <- colSums(dat_LHS ^ 2) - ml * ml * n
  vr <- colSums(dat_RHS ^ 2) - mr * mr * n
  ##V <- crossprod(dat - rep(m, each = n))  ## cov(u, v) = E[(u - E[u])(v - E[v])]
  V <- crossprod(dat_LHS, dat_RHS) - tcrossprod(ml * sqrt(n), mr * sqrt(n))  ## cov(u, v) = E[uv] - E{u]E[v]
  ## R-squared (explained variance) and its complement
  R2 <- (V ^ 2) * tcrossprod(1 / vl, 1 / vr)
  R2_complement <- 1 - R2
  ## slope and intercept
  beta <- V * rep(1 / vr, each = pl)
  alpha <- ml - beta * rep(mr, each = pl)
  ## residual sum of squares and standard error
  RSS <- R2_complement * vl
  sig <- sqrt(RSS * (1 / (n - 2)))
  ## statistics for slope
  beta.se <- sig * rep(1 / sqrt(vr), each = pl)
  beta.tv <- beta / beta.se
  beta.pv <- 2 * pt(abs(beta.tv), n - 2, lower.tail = FALSE)
  ## F-statistic and p-value
  F.fv <- (n - 2) * R2 / R2_complement
  F.pv <- pf(F.fv, 1, n - 2, lower.tail = FALSE)
  ## export
  data.frame(LHS = rep(colnames(dat_LHS), times = pr),
             RHS = rep(colnames(dat_RHS), each = pl),
             alpha = c(alpha),
             beta = c(beta),
             beta.se = c(beta.se),
             beta.tv = c(beta.tv),
             beta.pv = c(beta.pv),
             sig = c(sig),
             R2 = c(R2),
             F.fv = c(F.fv),
             F.pv = c(F.pv),
             stringsAsFactors = FALSE)
  }

将此应用于问题的示例1 .

general_paired_simpleLM(dat[1:3], dat[4:5])
#  LHS RHS        alpha       beta    beta.se   beta.tv      beta.pv        sig
#1   A   D -0.009212582  0.3450939 0.01171768  29.45071 1.772671e-50 0.09044509
#2   B   D  0.012474593  0.2389177 0.01420516  16.81908 1.201421e-30 0.10964516
#3   C   D -0.005958236  0.4565443 0.01397619  32.66585 1.749650e-54 0.10787785
#4   A   E  0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
#5   B   E  0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.10581331
#6   C   E  0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.11849472
#         R2      F.fv         F.pv
#1 0.8984818  867.3441 1.772671e-50
#2 0.7427021  282.8815 1.201421e-30
#3 0.9158840 1067.0579 1.749650e-54
#4 0.8590604  597.3333 1.738263e-43
#5 0.7603718  310.9670 3.636655e-32
#6 0.8985126  867.6375 1.746439e-50

将此应用到问题中的示例2 .

general_paired_simpleLM(dat[1:4], dat[5])
#  LHS RHS       alpha       beta    beta.se   beta.tv      beta.pv       sig
#1   A   E 0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.1065687
#2   B   E 0.012738403 -0.3437776 0.01949488 -17.63426 3.636655e-32 0.1058133
#3   C   E 0.009068106 -0.6430553 0.02183128 -29.45569 1.746439e-50 0.1184947
#4   D   E 0.066190196 -1.3767586 0.03597657 -38.26820 9.828853e-61 0.1952718
#         R2      F.fv         F.pv
#1 0.8590604  597.3333 1.738263e-43
#2 0.7603718  310.9670 3.636655e-32
#3 0.8985126  867.6375 1.746439e-50
#4 0.9372782 1464.4551 9.828853e-61

将此应用于问题中的示例3 .

general_paired_simpleLM(dat[1], dat[2:5])
#  LHS RHS        alpha       beta    beta.se   beta.tv      beta.pv        sig
#1   A   B  0.112229318  1.0703491 0.07686011  13.92594 5.796437e-25 0.16446951
#2   A   C  0.025628210  0.7114422 0.02789832  25.50126 4.731618e-45 0.10272687
#3   A   D -0.009212582  0.3450939 0.01171768  29.45071 1.772671e-50 0.09044509
#4   A   E  0.008650812 -0.4798639 0.01963404 -24.44040 1.738263e-43 0.10656866
#         R2     F.fv         F.pv
#1 0.6643051 193.9317 5.796437e-25
#2 0.8690390 650.3142 4.731618e-45
#3 0.8984818 867.3441 1.772671e-50
#4 0.8590604 597.3333 1.738263e-43

我们甚至可以在两个变量之间进行简单的线性回归:

We can even just do a simple linear regression between two variables:

general_paired_simpleLM(dat[1], dat[2])
#  LHS RHS     alpha     beta    beta.se  beta.tv      beta.pv       sig
#1   A   B 0.1122293 1.070349 0.07686011 13.92594 5.796437e-25 0.1644695
#         R2     F.fv         F.pv
#1 0.6643051 193.9317 5.796437e-25

这意味着其中的simpleLM功能现在已过时.

This means that the simpleLM function in is now obsolete.

附录:图片需要Markdown(需要MathJax支持)

Denote our variables by $x_1$, $x_2$, etc, a pairwise simple linear regression takes the form $$x_i = \alpha_{ij} + \beta_{ij}x_j$$ where $\alpha_{ij}$ and $\beta_{ij}$ is the intercept and the slope of $x_i \sim x_j$, respectively. We also denote $m_i$ and $v_i$ as the sample mean and **unscaled** sample variance of $x_i$. Here, the unscaled variance is just the sum of squares without dividing by sample size, that is $v_i = \sum_{k = 1}^n(x_{ik} - m_i)^2 = (\sum_{k = 1}^nx_{ik}^2) - n m_i^2$. We also denote $V_{ij}$ as the **unscaled** covariance between $x_i$ and $x_j$: $V_{ij} = \sum_{k = 1}^n(x_{ik} - m_i)(x_{jk} - m_j)$ = $(\sum_{k = 1}^nx_{ik}x_{jk}) - nm_im_j$.

Using the results for a simple linear regression given in [Function to calculate R2 (R-squared) in R](https://stackoverflow.com/a/40901487/4891738), we have $$\beta_{ij} = V_{ij} \ / \ v_j,\quad \alpha_{ij} = m_i - \beta_{ij}m_j,\quad r_{ij}^2 = V_{ij}^2 \ / \ (v_iv_j),$$ where $r_{ij}^2$ is the R-squared. Knowing $r_{ij}^2 = RSS_{ij} \ / \ TSS_{ij}$ where $RSS_{ij}$ and $TSS_{ij} = v_i$ are residual sum of squares and total sum of squares of $x_i \sim x_j$, we can derive $RSS_{ij}$ and residual standard error $\sigma_{ij}$ **without actually computing residuals**: $$RSS_{ij} = (1 - r_{ij}^2)v_i,\quad \sigma_{ij} = \sqrt{RSS_{ij} \ / \ (n - 2)}.$$

F-statistic $F_{ij}$ and associated p-value $p_{ij}^F$ can also be obtained from sum of squares: $$F_{ij} = \tfrac{(TSS_{ij} - RSS_{ij}) \ / \ 1}{RSS_{ij} \ / \ (n - 2)} = (n - 2) r_{ij}^2 \ / \ (1 - r_{ij}^2),\quad p_{ij}^F = 1 - \texttt{CDF_F}(F_{ij};\ 1,\ n - 2),$$ where $\texttt{CDF_F}$ denotes the CDF of F-distribution.

The only thing left is the standard error $e_{ij}$, t-statistic $t_{ij}$ and associated p-value $p_{ij}^t$ for $\beta_{ij}$, which are $$e_{ij} = \sigma_{ij} \ / \ \sqrt{v_i},\quad t_{ij} = \beta_{ij} \ / \ e_{ij},\quad p_{ij}^t = 2 * \texttt{CDF_t}(-|t_{ij}|; \ n - 2),$$ where $\texttt{CDF_t}$ denotes the CDF of t-distribution.

这篇关于数据框中变量之间的快速成对简单线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆