大数据回归 [英] Regression with big data

查看：50 发布时间：2021/6/13 20:31:23 python numpy pandas scikit-learn

本文介绍了大数据回归的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个变量 (y,x) 的数据:80,000 个组的 7 年每周数据(364 周).我需要按组贬低数据，并对 y 进行回归(x 加上需要创建的 8 个虚拟变量).有364*80,000*10，或大约3000万个数据点.我在服务器上借用"了一个帐户，发现回归需要至少 144GB 的内存.我通常无法访问此服务器，而且我的计算机只有 24GB 内存.

I have data on two variables (y,x): 7 years of weekly data (364 weeks) for 80,000 groups. I need to demean the data by groups, and do a regression of y on (x plus 8 dummy variables that need to be created). There are 364*80,000*10, or about 30Million data points. I 'borrowed' an account on a server and find that the regression needs at least 144GB of memory. I don't usually have access to this server and my computer only have 24GB of ram.

我想将回归分解为 8 个部分，而不是 inv(X'X)X'Y.回归 1 使用前 10,000 个组的数据.这给 X1'X1 和 X1'y1回归 2 使用组 10,001 到 20,000 的数据并给出 X2'X2, X2'y2依此类推，其中 X_j =x_j+ group_j 的虚拟变量.

Instead of inv(X'X)X'Y, I am thinking to break up the regression into 8 parts. Regression 1 uses data for the first 10,000 groups. This gives X1'X1 and X1'y1 Regression 2 uses data for groups 10,001 to 20,000 and gives X2'X2, X2'y2 and so on, where X_j =x_j+ dummies for group_j.

那么我的估计是 inv(X1'X1+..X8'X8)(X1y1+ ... X8y8).

Then my estimate would be inv(X1'X1+..X8'X8)(X1y1+ ... X8y8).

问题是有效地读取数据来做到这一点.数据在一个 csv 文件中，而不是按组组织的.我想读入整个数据集并将其转储到一个有组织的新 csv 文件中.然后我每次读取 10,000*360 行，重复 8 次.

The problem is efficiently reading the data to do this. The data are in a csv file and not organized by groups. I am thinking to read in the entire dataset and dump it out to an organized new csv file. Then I read 10,000*360 rows each time, and repeat 8 times.

我的问题是

有没有更有效的方法来进行这种回归?

is there a more efficient way to do this regression?

有没有办法绕过创建新的 csv 文件?如果我必须创建一个新的数据文件，第一种格式是什么?(没用过pytable或h5py，愿意考虑)

is there a way to bypass creating a new csv file? If I do have to create a new data file, what is the first format? (have never used pytable or h5py, and willing to consider)

如果我调整 LASSO 来执行 OLS 而不是正则化回归，scikit-learn 会比 sm.OLS 更有效吗?

Would scikit-learn be more efficient than sm.OLS, if I tweak LASSO to do an OLS instead of a regularized regression?

建议将不胜感激.提前致谢.

Suggestions would be greatly appreciated. Thanks in advance.

大数据回归 [英] Regression with big data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大数据回归 [英] Regression with big data

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭