在python中计算许多回归的最快方法? [英] Fastest way to calculate many regressions in python?

查看:55
本文介绍了在python中计算许多回归的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为我对如何实现此目标有一个非常合理的想法,但我不确定所有步骤是否100%正确.这个问题主要是为了进行健全性检查,以确保我以最有效的方式进行此操作,并且我的数学实际上是正确的(因为我的统计学知识还不是很完美).

I think I have a pretty reasonable idea on how to do go about accomplishing this, but I'm not 100% sure on all of the steps. This question is mostly intended as a sanity check to ensure that I'm doing this in the most efficient way, and that my math is actually sound (since my statistics knowledge is not completely perfect).

无论如何,关于我要做什么的一些解释:

Anyways, some explanation about what I'm trying to do:

我有很多时间序列数据,希望对它们进行一些线性回归.特别是,我对500个不同的变量进行了大约2000次观察.对于每个变量,我需要使用两个解释变量(大约2000个观察值的两个附加向量)执行回归.因此,对于 500 个不同的 Y 中的每一个,我需要在以下回归 Y = aX_1 + bX_2 + e 中找到 ab.

I have a lot of time series data that I would like to perform some linear regressions on. In particular, I have roughly 2000 observations on 500 different variables. For each variable, I need to perform a regression using two explanatory variables (two additional vectors of roughly 2000 observations). So for each of 500 different Y's, I would need to find a and b in the following regression Y = aX_1 + bX_2 + e.

到目前为止,我一直在使用statsmodels包中的OLS函数来执行回归.但是,据我所知,如果我想使用statsmodels包来解决问题,我将不得不调用它数百次,这通常看起来效率很低.

Up until this point, I have been using the OLS function in the statsmodels package to perform my regressions. However, as far as I can tell, if I wanted to use the statsmodels package to accomplish my problem, I would have to call it hundreds of times, which just seems generally inefficient.

因此,我决定重新审视一些我很长时间没有真正接触过的统计数据.如果我的知识仍然正确,则可以将所有观察结果放入一个大约2000 x 500的大Y矩阵中.然后,可以将我的解释变量粘贴到大约2000 x 2的X矩阵中,并获得所有结果通过计算(X'Y)/(X'X)进行500次回归.如果我使用基本的numpy东西(使用*的矩阵乘法和使用matrix.I的逆矩阵)进行此操作,我想它会比执行数百个statsmodel OLS调用快得多.

So instead, I decided to revisit some statistics that I haven't really touched in a long time. If my knowledge is still correct, I can put all of my observations into one large Y matrix that is roughly 2000 x 500. I can then stick my explanatory variables into an X matrix that is roughly 2000 x 2, and get the results of all 500 of my regressions by calculating (X'Y)/(X'X). If I do this using basic numpy stuff (matrix multiplication using * and inverses using matrix.I), I'm guessing it will be much faster than doing hundreds of statsmodel OLS calls.

以下是我的问题:

  • 与以前多次调用statsmodels的方法相比,我做的numpy东西速度更快吗?如果是这样,这是完成我想要的最快/最有效的方法吗?我假设是这样,但是如果您知道更好的方法,那么我很乐意听到.(当然,我不是第一个需要以此方式计算许多回归的人.)
  • 如何处理矩阵中丢失的数据?我的时间序列数据不会很好,也不完整,并且有时会丢失值.如果我只是尝试在numpy中进行常规矩阵乘法,则NA值将传播,最终我将得到一个大多数为NA的矩阵.如果我独立执行每个回归,则可以在执行回归之前删除包含NA的行,但是如果在2000 x 500大型矩阵上执行此操作,则最终会从其他一些变量中删除实际的非NA数据,而且我显然不希望这种情况发生.
  • 在首先将时间序列数据放入矩阵之前,确保其时间序列正确对齐的最有效方法是什么?我观察的开始日期和结束日期不一定相同,有些系列可能有其他系列没有的日子.如果我要选择一种执行此操作的方法,则将所有观察结果放入按日期索引的熊猫数据框中.然后,熊猫将最终完成所有工作,为我对齐所有内容,完成后我可以提取底层的ndarray.这是最好的方法,还是熊猫可以通过以其他方式进行矩阵构建来避免某种开销?

推荐答案

一些简短的答案

1)重复调用statsmodels并不是最快的方法.如果我们只需要参数,预测和残差,并且具有相同的解释变量,那么我通常只使用 params = pinv(x).dot(y),其中y是二维的,然后从那里计算其余部分.问题在于推理,置信区间和类似条件需要工作,因此除非速度至关重要,并且仅需要有限的结果集,否则statsmodels OLS仍然更加方便.

1) Calling statsmodels repeatedly is not the fastest way. If we just need parameters, prediction and residual and we have identical explanatory variables, then I usually just use params = pinv(x).dot(y) where y is 2 dimensional and calculate the rest from there. The problem is that inference, confidence intervals and similar require work, so unless speed is crucial and only a restricted set of results is required, statsmodels OLS is still more convenient.

这仅在所有y和x都具有相同的观测索引,没有缺失值且没有间隔的情况下起作用.

This only works if all y and x have the same observations indices, no missing values and no gaps.

此外:该设置是一个多元线性模型,希望在不久的将来由statsmodels支持.

Aside: The setup is a Multivariate Linear Model which will be supported by statsmodels in, hopefully not very far, future.

2)和3)如果缺少单元格或观察值(索引)没有完全重叠,则情况1)的快速简单线性代数不起作用.在类似于面板数据的情况下,第一种情况需要平衡"面板,其他情况则意味着不平衡"数据.标准方法是将数据与解释变量以块对角线的形式堆叠.由于这会大量增加内存,因此使用稀疏矩阵和稀疏线性代数更好.在特定情况下,构建和解决稀疏问题是否比循环单个OLS回归要快.

2) and 3) The fast simple linear algebra of case 1) does not work if there are missing cells or no complete overlap of observation (indices). In the analog to panel data, the first case requires "balanced" panels, the other cases imply "unbalanced" data. The standard way is to stack the data with the explanatory variables in a block-diagonal form. Since this increases the memory by a large amount, using sparse matrices and sparse linear algebra is better. It depends on the specific cases whether building and solving the sparse problem is faster than looping over individual OLS regressions.

专业代码:(只是一个想法):

Specialized code: (Just a thought):

在情况2)的值不完全重叠或在单元格上丢失的情况下,我们仍然需要为所有y(即其中的500个)计算所有x'x和x'y矩阵.鉴于您只有两个回归量 500 x 2 x 2 仍然不需要大内存.因此,有可能通过使用非缺失掩码作为叉积计算中的权重来计算参数,预测和残差.据我所知,numpy 已经矢量化了 linalg.inv.因此,我认为,这可以通过一些矢量化计算来完成.

In case 2) with not fully overlapping or cellwise missing values, we would still need to calculate all x'x, and x'y matrices for all y, i.e. 500 of those. Given that you only have two regressors 500 x 2 x 2 would still not require a large memory. So it might be possible to calculate params, prediction and residuals by using the non-missing mask as weights in the cross-product calculations. numpy has vectorized linalg.inv, as far as I know. So, I think, this could be done with a few vectorized calculations.

这篇关于在python中计算许多回归的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆