大数据回归 [英] Regression with big data

查看:50
本文介绍了大数据回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个变量 (y,x) 的数据:80,000 个组的 7 年每周数据(364 周).我需要按组贬低数据,并对 y 进行回归(x 加上需要创建的 8 个虚拟变量).有364*80,000*10,或大约3000万个数据点.我在服务器上借用"了一个帐户,发现回归需要至少 144GB 的内存.我通常无法访问此服务器,而且我的计算机只有 24GB 内存.

I have data on two variables (y,x): 7 years of weekly data (364 weeks) for 80,000 groups. I need to demean the data by groups, and do a regression of y on (x plus 8 dummy variables that need to be created). There are 364*80,000*10, or about 30Million data points. I 'borrowed' an account on a server and find that the regression needs at least 144GB of memory. I don't usually have access to this server and my computer only have 24GB of ram.

我想将回归分解为 8 个部分,而不是 inv(X'X)X'Y.回归 1 使用前 10,000 个组的数据.这给 X1'X1 和 X1'y1回归 2 使用组 10,001 到 20,000 的数据并给出 X2'X2, X2'y2依此类推,其中 X_j =x_j+ group_j 的虚拟变量.

Instead of inv(X'X)X'Y, I am thinking to break up the regression into 8 parts. Regression 1 uses data for the first 10,000 groups. This gives X1'X1 and X1'y1 Regression 2 uses data for groups 10,001 to 20,000 and gives X2'X2, X2'y2 and so on, where X_j =x_j+ dummies for group_j.

那么我的估计是 inv(X1'X1+..X8'X8)(X1y1+ ... X8y8).

Then my estimate would be inv(X1'X1+..X8'X8)(X1y1+ ... X8y8).

问题是有效地读取数据来做到这一点.数据在一个 csv 文件中,而不是按组组织的.我想读入整个数据集并将其转储到一个有组织的新 csv 文件中.然后我每次读取 10,000*360 行,重复 8 次.

The problem is efficiently reading the data to do this. The data are in a csv file and not organized by groups. I am thinking to read in the entire dataset and dump it out to an organized new csv file. Then I read 10,000*360 rows each time, and repeat 8 times.

我的问题是

  1. 有没有更有效的方法来进行这种回归?

  1. is there a more efficient way to do this regression?

有没有办法绕过创建新的 csv 文件?如果我必须创建一个新的数据文件,第一种格式是什么?(没用过pytable或h5py,愿意考虑)

is there a way to bypass creating a new csv file? If I do have to create a new data file, what is the first format? (have never used pytable or h5py, and willing to consider)

如果我调整 LASSO 来执行 OLS 而不是正则化回归,scikit-learn 会比 sm.OLS 更有效吗?

Would scikit-learn be more efficient than sm.OLS, if I tweak LASSO to do an OLS instead of a regularized regression?

建议将不胜感激.提前致谢.

Suggestions would be greatly appreciated. Thanks in advance.

推荐答案

也许不是一个确定的答案,但有一些评论:

Maybe not a definite answer, but some comments:

  1. 使用矩阵求逆在数值上不是很稳定.标准解决方案,如 scipy.linalg.lstsq() 使用正确的矩阵分解而不是inv(X'X)X'Y.
  2. 由于最小二乘法是一个线性估计器,因此将数据分块并逐步计算结果是没有问题的,这样可以减少所需的 RAM.这里描述了如何将一个LQ拆分为两个块,可以是很容易推广到更多的块.递归最小二乘滤波器就是基于这个想法.对于您的数据大小,您应该牢记数值稳定性.
  3. Pytables 似乎是个好主意,因为它可以处理无法放入内存的数据.numpy.save() 将是 CSV 的更简单、更快的替代方法.
  1. Using a matrix inverse is numerically not very stable. Standard solutions like scipy.linalg.lstsq() use proper matrix decompositions instead off inv(X'X)X'Y.
  2. Since the Least squares is a linear estimator, it is no problem to split your data in blocks and calculate the results step by step, which cuts down the required RAM. It is described here how to split up a LQ into two blocks, which can be easily generalized to more blocks. The Recursive least squares filter is based on that idea. For your data size, you should keep numeric stability in mind.
  3. Pytables seems like a good idea, since it can handle data, which does not fit into memory. numpy.save() would be a simpler and faster alternative to CSVs.

这篇关于大数据回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆