PCA:princomp()如何工作,我可以用它为ARIMA提取变量吗? [英] PCA: How does princomp() work and can I use it to pick up variables for ARIMA?

查看:73
本文介绍了PCA:princomp()如何工作,我可以用它为ARIMA提取变量吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用PCA选择好的预测变量,以在arima模型的xreg参数中使用,以尝试预测下面的tVar变量.我只使用下面的简化数据集,其中仅包含几个变量,以使示例变得简单.

I'm trying to use PCA to pick good predictors to use in the xreg argument of an arima model to try to forecast the tVar variable below. I am just using the reduced dataset below with just a few variables to make the example simple.

我试图了解princomp中的公式参数如何工作.对于下面的pc对象,是否表示使用xVar1xVar2来解释na.omit(dfData[,c("tVar","xVar1","xVar2")])中的差异"?

I am trying to understand how the formula argument in princomp works. For the pc object below, is it saying "use xVar1 and xVar2 to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])" ?

我最终想要做的是创建一个新变量,该变量解释tVar中的大部分差异.那是我可以使用PCA来做的事情吗?如果是这样,有人可以解释一下还是将我引向一个示例?

What I ultimately would like to do is create a new variable which explains most of the variance in tVar. Is that something I can do using PCA? If so, could someone please explain how or point me towards an example?

代码:

pc <- princomp(~xVar1+xVar2,
               data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), 
               cor=TRUE)

数据:

dput(na.omit(dfData[1:100,c("tVar","xVar1","xVar2")]))
structure(list(tVar = c(11, 14, 17, 5, 5, 5.5, 8, 5.5, 
          6.5, 8.5, 4, 5, 9, 10, 11, 7, 6, 7, 7, 5, 6, 9, 9, 6.5, 9, 3.5, 
          2, 15, 2.5, 17, 5, 5.5, 7, 6, 3.5, 6, 9.5, 5, 7, 4, 5, 4, 9.5, 
          3.5, 5, 4, 4, 9, 4.5, 6, 10, 9.5, 15, 9, 5.5, 7.5, 12, 17.5, 
          19, 7, 14, 17, 3.5, 6, 15, 11, 10.5, 11, 13, 9.5, 9, 7, 4, 6, 
          15, 5, 18, 5, 6, 19, 19, 6, 7, 7.5, 7.5, 7, 6.5, 9, 10, 5.5, 
          5, 7.5, 5, 4, 10, 7, 5, 12), xVar1 = c(0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
          1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
          xVar2  = c(0L, 
          1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 
          2L, 3L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 
          0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 
          0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 3L, 1L, 0L, 1L, 2L,
          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 
          1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 
          0L)), .Names = c("tVar", "xVar1", "xVar2"
          ), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 
          13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,25L, 
          26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L,38L, 
          39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L,51L, 
          52L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L,
          66L, 67L, 68L, 69L, 70L, 71L, 72L, 73L, 74L, 75L, 76L, 77L, 78L, 
          79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 
          92L, 93L, 94L, 95L, 96L, 97L, 98L, 99L, 100L),
          class  = "data.frame", na.action = structure(c(8L,53L),
          .Names = c("8", "53"), class = "omit"))

推荐答案

(这是一篇很好的文章!今天有关于PCA的另一篇文章很有趣.尽管该问题更基本,但关于 princompprcomp 之间的区别,但是我在

(This is a very good post! It is interesting to have another post today regarding PCA. Though that question is more basic, regarding the difference between princomp and prcomp, but the mathematical details with R code I make in the answer may be beneficial to any one learning PCA.)

在以下情况下,PCA用于减小尺寸(低秩近似):

PCA is used for dimension reduction (low-rank approximation), when:

  1. 您有很多(例如p)相关变量x1, x2, ..., xp;
  2. 您希望将它们缩小为一小部分(例如k < p)新的线性独立变量z1, z2, ..., zk;
  3. 您要使用z1, z2, ..., zk而不是x1, x2, ..., xp来预测响应变量y.
  1. you have a lot of (say p) correlated variables x1, x2, ..., xp;
  2. you want to shrink them to a small number (say k < p) of new, linearly independent variables z1, z2, ..., zk;
  3. you want to use z1, z2, ..., zk rather than x1, x2, ..., xp to predict a response variable y.


基本情况和一些数学

假设您有一个响应变量y,不删除任何变量的全线性回归应采用以下公式:

Suppose you have a response variable y, a full linear regression without dropping any variables should take the formula:

y ~ x1 + x2 + ... + xp

但是,在PCA之后,我们可以建立一个合理的近似模型.假设X是上面的模型矩阵,即通过按列组合x1, x2, ... , xp的所有观察值的矩阵,然后

However, we can do a reasonable approximate model, after PCA. Let X be the model matrix in above, i.e., the matrix by combining all observations of x1, x2, ... , xp by column, then

S <- cor(X)  ## get correlation matrix S
E <- eigen(S)  ## compute eigen decomposition of S
root_eigen_value <- sqrt(E$values)  ## square root of eigen values
eigen_vector_mat <- E$vectors  ## matrix of eigen vectors
X1 <- scale(X) %*% eigen_vector_mat  ## transform original matrix

现在,root_eigen_value(长度-p向量)单调递减,即,对总协方差的贡献正在递减,因此我们只能选择第一个k值.因此,我们可以选择转换后的矩阵X1的前k列.让我们开始吧:

Now, root_eigen_value (a length-p vector) is monotonically decreasing, i.e., the contribution to total covariance is decreasing, hence we can only select the first k values. Accordingly, we can select the first k columns of transformed matrix X1. Let's do:

Z <- X1[, 1:k]

现在,我们已经成功地将p变量简化为k变量,并且Z的每一列都是新的变量z1, z2, ..., zk.请记住,这些变量不是原始变量的子集.他们是全新的,没有名字.但是,由于我们只对预测y感兴趣,因此给z1, z2, ..., zk命名是无关紧要的.然后我们可以拟合一个近似的线性模型:

Now, we have successfully reduced p variables to k variables, and each column of Z is the new variable z1, z2, ..., zk. Bear in mind that these variables are not a subset of original variables; they are completely new, without names. But since we are only interested in predicting y, it does not matter what name we give to z1, z2, ..., zk. Then we can fit an approximate linear model:

y ~ z1 + z2 + ... + zk


使用princomp()


Use princomp()

实际上,事情变得更容易了,因为princomp()为我们完成了所有计算.通过致电:

In fact, things are easier, because princomp() does all the computation for us. By calling:

pc <- princomp(~ x1 + x2 + ... + xp, data, cor = TRUE)

我们可以得到我们想要的一切.在pc中的一些返回值中:

we can get all we want. Among the a few returned values in pc:

  1. pc$sdev给出root_eigen_value.如果执行plot(pc),则可以看到显示此内容的条形图.如果您的输入数据高度相关,那么您将期望在该图中看到接近指数的衰减,只有很少的变量支配协方差. (很遗憾,您的玩具数据无效.xVar1xVar2是二进制的,它们已经线性独立,因此在PCA之后,您将看到它们都发挥了相同的作用. )
  2. pc$loadings给出eigen_vector_mat;
  3. pc$scores给出X1.
  1. pc$sdev gives root_eigen_value. If you do plot(pc), you can see a barplot showing this. If your input data are highly correlated, then you are expected to see a near exponential decay in this figure, with only a few variables dominating the covariance. (Unfortunately, your toy data is not going to work. xVar1 and xVar2 are binary, and they are already linearly independent, hence after PCA, you will see that they both give equal contribution.)
  2. pc$loadings gives eigen_vector_mat;
  3. pc$scores gives X1.


使用arima()


Use arima()

变量选择过程很简单.如果决定通过检查plot(pc)从总共p个变量中取出第一个k变量,则可以提取pc$scores矩阵的前几个k列.每列构成z1, z2, ..., zk,然后通过参数reg将它们传递到arima().

The variable selection process is simple. If you decide to take the first k variables out of a total of p variables, by inspecting plot(pc), then you extract the first k columns of the pc$scores matrix. Each column forms z1, z2, ..., zk, and pass them to arima() via argument reg.

回到您有关公式的问题

对于下面的pc对象,是说使用xVar1和xVar2来解释na.omit(dfData [,c("tVar","xVar1","xVar2")]))的差异

For the pc object below, is it saying "use xVar1 and xVar2 to explain the variance in na.omit(dfData[,c("tVar","xVar1","xVar2")])"

经过我的解释,您应该知道答案是否".请勿将回归步骤中使用的响应变量tVar与PCA步骤中使用的预测变量xVar1xVars,...混合使用.

After my explanation, you should know the answer is "No". Do not mix response variable tVar used in the regression step, with the predictor variables xVar1, xVars, ... used in the PCA step.

princomp()允许三种方式传递参数:

princomp() allows three ways to pass in arguments:

  1. 通过公式和数据;
  2. 通过模型矩阵;
  3. 通过协方差矩阵.

您已选择第一种方式.该公式用于告诉princomp()data提取数据,稍后将计算模型矩阵,协方差矩阵,相关矩阵,本征分解,直到最终获得PCA的结果.

You have chosen the first way. The formula is used to tell princomp() to extract data from data, and it will later compute model matrix, covariance matrix, correlation matrix, eigen decomposition, till we finally get the result of PCA.

关注您的评论

因此,如果我理解正确,PCA主要用于减少变量的数量,因此我不应该在公式或数据中加入响应变量tVar.但是我想知道为什么princomp(~xVar1+xVar2, data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), cor=TRUE)princomp(na.omit(dfData[,c("xVar1","xVar2")]), cor=TRUE)基本上是等效的?

So if I understand correctly, PCA is primarily for reducing the number of variables, and I shouldn't include the response variable tVar in the formula or data. But I was wondering why princomp(~xVar1+xVar2, data = na.omit(dfData[,c("tVar","xVar1","xVar2")]), cor=TRUE) and princomp(na.omit(dfData[,c("xVar1","xVar2")]), cor=TRUE) are basically equivalent?

该公式告诉您如何从数据框中提取矩阵.由于您使用相同的公式~ xVar1 + xVar2,因此是否在数据框中包含要传递给princomp的tVars都没有区别,因为princomp不会触及该列.

The formula tells how to extract matrix from the data frame. Since you use the same formula ~ xVar1 + xVar2, whether you include tVars in the data frame to pass to princomp makes no difference, as that column will not be touched by princomp.

在您的PCA公式中不要包含tVars.就像我说的那样,回归和PCA是不同的问题,不应相互混淆.

Do not include tVars in your formula for PCA. As I said, regression and PCA are different problems, and shall not be confused with each other.

需要明确的是,使用PCA的策略不是创建一个新变量,该变量是xVar1xVar2的组合,并解释了tVar中的大部分差异,而是创建了一个新变量,即xVar1xVar2的组合,并解释了dfData[,c("xVar1","xVar2")]的大部分差异?

To be clear, the strategy with PCA isn't to create a new variable which is a combination of xVar1 and xVar2 and explains most the variance in tVar, but rather to create a new variable which is a combination of xVar1 and xVar2 and explains most the variance of dfData[,c("xVar1","xVar2")]?

是的.回归(或设置中的arima())用于设置响应tVars与预测变量x1, x2, ..., xpz1, z2, ..., zk之间的关系.回归/主题模型将根据预测变量来解释响应的均值和方差.

Yes. Regression (or arima() in your setting) is used to set up relation between your response tVars and predictor variables x1, x2, ..., xp or z1, z2, ..., zk. A regression/arima model will explain the mean and variance of response, in terms of predictors.

PCA是一个不同的问题.它只选择原始预测变量xVar1, xVar2, ...的低阶(较少参数)表示形式,因此您可以在以后的回归/ARIMA建模中使用较少的变量.

PCA is a different problem. It only select a low rank (fewer parameters) representation of your original predictor variables xVar1, xVar2, ..., so that you can use fewer variables in later regression / ARIMA modelling.

仍然,您可能需要考虑是否应该对问题进行PCA.

Still, you might need to think about whether you should do PCA for your problem.

  1. 您是否有很多变量,例如10+?在统计建模中,通常达到数十万个参数.如果我们全部使用它们,计算会变得很慢.在这种情况下,PCA很有用,可降低计算复杂度,同时合理地表示原始协方差.
  2. 您的变量是否高度相关?如果它们很容易彼此线性独立,则PCA可能不会掉落任何东西.例如,您给出的玩具数据xVar1xVar2只是线性独立的,因此无法减小尺寸.您可以通过pairs(mydata)查看数据中的相关性.更好的可视化可能是使用corrplot R软件包.有关如何使用它绘制协方差矩阵的示例,请参见此答案.
  1. Do you have a lot of variables, say 10+? In statistical modelling, it is common to reach hundreds of thousands of parameters. Computation can get very slow if we use all of them. PCA is useful in this case, to reduce computational complexity, while giving a reasonable representation of original covariance.
  2. Are your variables highly correlated? If they are readily linearly independent of each other, PCA may not drop anything. For example, the toy data xVar1 and xVar2 you gave are just linearly independent, so dimension reduction is impossible. You can view the correlation in your data, by pairs(mydata). A better visualization may be to use corrplot R package. See this answer for examples on how to use it to plot covariance matrix.

这篇关于PCA:princomp()如何工作,我可以用它为ARIMA提取变量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆