PCA:princomp() 是如何工作的,我可以使用它为 ARIMA 获取变量吗? [英] PCA: How does princomp() work and can I use it to pick up variables for ARIMA?

查看:27
本文介绍了PCA:princomp() 是如何工作的,我可以使用它为 ARIMA 获取变量吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 PCA 来选择好的预测器,以便在 arima 模型的 xreg 参数中使用,以尝试预测 tVar 下面的变量.我只是使用下面的简化数据集和几个变量来简化示例.

I'm trying to use PCA to pick good predictors to use in the xreg argument of an arima model to try to forecast the tVar variable below. I am just using the reduced dataset below with just a few variables to make the example simple.

我试图了解 princomp 中的公式参数是如何工作的.对于下面的 pc 对象,是不是说使用 xVar1xVar2 来解释 na.omit(dfData[,c("tVar","xVar1","xVar2")])" ?

I am trying to understand how the formula argument in princomp works. For the pc object below, is it saying "use xVar1 and xVar2 to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])" ?

我最终想做的是创建一个新变量来解释 tVar 中的大部分差异.这是我可以使用 PCA 做的事情吗?如果是这样,有人可以解释一下如何或指出我的例子吗?

What I ultimately would like to do is create a new variable which explains most of the variance in tVar. Is that something I can do using PCA? If so, could someone please explain how or point me towards an example?

代码:

pc <- princomp(~xVar1+xVar2,
               data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), 
               cor=TRUE)

数据:

dput(na.omit(dfData[1:100,c("tVar","xVar1","xVar2")]))
structure(list(tVar = c(11, 14, 17, 5, 5, 5.5, 8, 5.5, 
          6.5, 8.5, 4, 5, 9, 10, 11, 7, 6, 7, 7, 5, 6, 9, 9, 6.5, 9, 3.5, 
          2, 15, 2.5, 17, 5, 5.5, 7, 6, 3.5, 6, 9.5, 5, 7, 4, 5, 4, 9.5, 
          3.5, 5, 4, 4, 9, 4.5, 6, 10, 9.5, 15, 9, 5.5, 7.5, 12, 17.5, 
          19, 7, 14, 17, 3.5, 6, 15, 11, 10.5, 11, 13, 9.5, 9, 7, 4, 6, 
          15, 5, 18, 5, 6, 19, 19, 6, 7, 7.5, 7.5, 7, 6.5, 9, 10, 5.5, 
          5, 7.5, 5, 4, 10, 7, 5, 12), xVar1 = c(0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
          1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
          xVar2  = c(0L, 
          1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 
          2L, 3L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
          0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
          0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 3L, 1L, 0L, 1L, 2L,
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 
          1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 
          0L)), .Names = c("tVar", "xVar1", "xVar2"
          ), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 
          13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,25L, 
          26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,38L, 
          39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,51L, 
          52L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L,
          66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 
          79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 
          92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L),
          class  = "data.frame", na.action = structure(c(8L,53L),
          .Names = c("8", "53"), class = "omit"))

推荐答案

(这是一个非常好的帖子!今天还有一篇关于 PCA 的帖子很有趣.虽然这个问题更基本,但关于 princompprcomp 的区别,但是用 R 代码的数学细节我make in 答案可能对任何学习PCA的人都有益.)

(This is a very good post! It is interesting to have another post today regarding PCA. Though that question is more basic, regarding the difference between princomp and prcomp, but the mathematical details with R code I make in the answer may be beneficial to any one learning PCA.)

PCA 用于降维(低秩近似),当:

PCA is used for dimension reduction (low-rank approximation), when:

  1. 你有很多(比如 p)相关变量 x1, x2, ..., xp;
  2. 您希望将它们缩小到少量(例如 k < p)新的线性独立变量 z1, z2, ..., zk;
  3. 您想使用 z1, z2, ..., zk 而不是 x1, x2, ..., xp 来预测响应变量 y.
  1. you have a lot of (say p) correlated variables x1, x2, ..., xp;
  2. you want to shrink them to a small number (say k < p) of new, linearly independent variables z1, z2, ..., zk;
  3. you want to use z1, z2, ..., zk rather than x1, x2, ..., xp to predict a response variable y.

<小时>

一张基本图和一点数学知识

假设你有一个响应变量y,不丢掉任何变量的全线性回归应该采用公式:

Suppose you have a response variable y, a full linear regression without dropping any variables should take the formula:

y ~ x1 + x2 + ... + xp

但是,我们可以在 PCA 之后做一个合理的近似模型.令 X 为上面的模型矩阵,即按列组合 x1, x2, ... , xp 的所有观测值的矩阵,然后

However, we can do a reasonable approximate model, after PCA. Let X be the model matrix in above, i.e., the matrix by combining all observations of x1, x2, ... , xp by column, then

S <- cor(X)  ## get correlation matrix S
E <- eigen(S)  ## compute eigen decomposition of S
root_eigen_value <- sqrt(E$values)  ## square root of eigen values
eigen_vector_mat <- E$vectors  ## matrix of eigen vectors
X1 <- scale(X) %*% eigen_vector_mat  ## transform original matrix

现在,root_eigen_value(一个长度-p向量)是单调递减的,即对总协方差的贡献是递减的,所以我们只能选择第一个k 个值.因此,我们可以选择变换矩阵 X1 的前 k 列.开始吧:

Now, root_eigen_value (a length-p vector) is monotonically decreasing, i.e., the contribution to total covariance is decreasing, hence we can only select the first k values. Accordingly, we can select the first k columns of transformed matrix X1. Let's do:

Z <- X1[, 1:k]

现在,我们已经成功将p变量缩减为k变量,Z的每一列都是新的变量z1,z2, ..., zk.请记住,这些变量不是原始变量的子集;它们是全新的,没有名字.但是由于我们只对预测 y 感兴趣,所以我们给 z1, z2, ..., zk 取什么名字并不重要.然后我们可以拟合一个近似的线性模型:

Now, we have successfully reduced p variables to k variables, and each column of Z is the new variable z1, z2, ..., zk. Bear in mind that these variables are not a subset of original variables; they are completely new, without names. But since we are only interested in predicting y, it does not matter what name we give to z1, z2, ..., zk. Then we can fit an approximate linear model:

y ~ z1 + z2 + ... + zk

<小时>

使用 princomp()

事实上,事情变得更容易了,因为 princomp() 为我们完成了所有的计算.通过调用:

In fact, things are easier, because princomp() does all the computation for us. By calling:

pc <- princomp(~ x1 + x2 + ... + xp, data, cor = TRUE)

我们可以得到我们想要的一切.在 pc 中的几个返回值中:

we can get all we want. Among the a few returned values in pc:

  1. pc$sdev 给出 root_eigen_value.如果你执行 plot(pc),你可以看到一个条形图显示这个.如果您的输入数据高度相关,那么您应该会在该图中看到接近指数的衰减,只有少数变量支配协方差.(不幸的是,你的玩具数据不起作用.xVar1xVar2 是二进制的,它们已经是线性独立的,因此在 PCA 之后,你会看到他们都做出了相同的贡献.)
  2. pc$loadings 给出 eigen_vector_mat;
  3. pc$scores 给出 X1.
  1. pc$sdev gives root_eigen_value. If you do plot(pc), you can see a barplot showing this. If your input data are highly correlated, then you are expected to see a near exponential decay in this figure, with only a few variables dominating the covariance. (Unfortunately, your toy data is not going to work. xVar1 and xVar2 are binary, and they are already linearly independent, hence after PCA, you will see that they both give equal contribution.)
  2. pc$loadings gives eigen_vector_mat;
  3. pc$scores gives X1.

<小时>

使用 arima()

变量选择过程很简单.如果您决定通过检查 plot(pc) 从所有 p 变量中取出第一个 k 变量,然后提取第一个pc$scores 矩阵的 k 列.每列形成 z1, z2, ..., zk,并通过参数 reg 将它们传递给 arima().

The variable selection process is simple. If you decide to take the first k variables out of a total of p variables, by inspecting plot(pc), then you extract the first k columns of the pc$scores matrix. Each column forms z1, z2, ..., zk, and pass them to arima() via argument reg.

回到关于公式的问题

对于下面的pc对象,是不是说用xVar1和xVar2来解释na.omit(dfData[,c("tVar","xVar1","xVar2")])"的方差"

For the pc object below, is it saying "use xVar1 and xVar2 to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])"

经过我的解释,您应该知道答案是否定的.不要将回归步骤中使用的响应变量 tVar 与 PCA 步骤中使用的预测变量 xVar1xVars、...混合.

After my explanation, you should know the answer is "No". Do not mix response variable tVar used in the regression step, with the predictor variables xVar1, xVars, ... used in the PCA step.

princomp() 允许三种方式传入参数:

princomp() allows three ways to pass in arguments:

  1. 按公式和数据;
  2. 按模型矩阵;
  3. 通过协方差矩阵.

你选择了第一种方式.该公式用于告诉princomp()data中提取数据,然后计算模型矩阵、协方差矩阵、相关矩阵、特征分解,直到我们最终得到PCA 的结果.

You have chosen the first way. The formula is used to tell princomp() to extract data from data, and it will later compute model matrix, covariance matrix, correlation matrix, eigen decomposition, till we finally get the result of PCA.

跟进您的评论

所以如果我理解正确的话,PCA主要是为了减少变量的数量,我不应该在公式或数据中包含响应变量tVar.但我想知道为什么 princomp(~xVar1+xVar2, data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), cor=TRUE)princomp(na.omit(dfData[,c("xVar1","xVar2")]), cor=TRUE) 基本等价?

So if I understand correctly, PCA is primarily for reducing the number of variables, and I shouldn't include the response variable tVar in the formula or data. But I was wondering why princomp(~xVar1+xVar2, data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), cor=TRUE) and princomp(na.omit(dfData[,c("xVar1","xVar2")]), cor=TRUE) are basically equivalent?

该公式说明如何从数据框中提取矩阵.由于您使用相同的公式 ~xVar1 + xVar2,因此是否在要传递给 princomp 的数据框中包含 tVars 没有区别,因为该列不会被 <代码>princomp.

The formula tells how to extract matrix from the data frame. Since you use the same formula ~ xVar1 + xVar2, whether you include tVars in the data frame to pass to princomp makes no difference, as that column will not be touched by princomp.

不要在您的 PCA 公式中包含 tVars.正如我所说,回归和 PCA 是不同的问题,不应相互混淆.

Do not include tVars in your formula for PCA. As I said, regression and PCA are different problems, and shall not be confused with each other.

需要明确的是,PCA 的策略不是创建一个新变量,它是 xVar1xVar2 的组合,并解释了 中的大部分差异tVar,而是创建一个新变量,它是 xVar1xVar2 的组合,并解释了 dfData[,c("xVar1","xVar2")]?

To be clear, the strategy with PCA isn't to create a new variable which is a combination of xVar1 and xVar2 and explains most the variance in tVar, but rather to create a new variable which is a combination of xVar1 and xVar2 and explains most the variance of dfData[,c("xVar1","xVar2")]?

是的.回归(或设置中的 arima() )用于设置响应 tVars 和预测变量 x1, x2, ..., xp<之间的关系/code> 或 z1, z2, ..., zk.回归/arima 模型将根据预测变量解释响应的均值和方差.

Yes. Regression (or arima() in your setting) is used to set up relation between your response tVars and predictor variables x1, x2, ..., xp or z1, z2, ..., zk. A regression/arima model will explain the mean and variance of response, in terms of predictors.

PCA 是一个不同的问题.它只选择原始预测变量 xVar1, xVar2, ... 的低秩(较少参数)表示,以便您可以在以后的回归/ARIMA 建模中使用较少的变量.

PCA is a different problem. It only select a low rank (fewer parameters) representation of your original predictor variables xVar1, xVar2, ..., so that you can use fewer variables in later regression / ARIMA modelling.

不过,您可能需要考虑是否应该针对您的问题进行 PCA.

Still, you might need to think about whether you should do PCA for your problem.

  1. 您是否有很多变量,比如 10+?在统计建模中,达到数十万个参数是很常见的.如果我们全部使用它们,计算会变得非常慢.PCA 在这种情况下很有用,可以降低计算复杂度,同时给出原始协方差的合理表示.
  2. 您的变量是否高度相关?如果它们很容易彼此线性独立,则 PCA 可能不会丢弃任何东西.比如你给的玩具数据xVar1xVar2只是线性独立的,所以降维是不可能的.您可以通过 pairs(mydata) 查看数据中的相关性.更好的可视化可能是使用 corrplot R 包.有关如何使用它绘制协方差矩阵的示例,请参阅this answer.
  1. Do you have a lot of variables, say 10+? In statistical modelling, it is common to reach hundreds of thousands of parameters. Computation can get very slow if we use all of them. PCA is useful in this case, to reduce computational complexity, while giving a reasonable representation of original covariance.
  2. Are your variables highly correlated? If they are readily linearly independent of each other, PCA may not drop anything. For example, the toy data xVar1 and xVar2 you gave are just linearly independent, so dimension reduction is impossible. You can view the correlation in your data, by pairs(mydata). A better visualization may be to use corrplot R package. See this answer for examples on how to use it to plot covariance matrix.

这篇关于PCA:princomp() 是如何工作的,我可以使用它为 ARIMA 获取变量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆