Scikit-Learn 逻辑回归记忆错误 [英] Scikit-Learn Logistic Regression Memory Error

查看:30
本文介绍了Scikit-Learn 逻辑回归记忆错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 sklearn 0.11 的 LogisticRegression 对象来拟合具有大约 80,000 个特征的 200,000 个观察值的模型.目标是将短文本描述分为 800 个类别中的 1 个.

I'm attempting to use sklearn 0.11's LogisticRegression object to fit a model on 200,000 observations with about 80,000 features. The goal is to classify short text descriptions into 1 of 800 classes.

当我尝试拟合分类器时 pythonw.exe 给了我:

When I attempt to fit the classifier pythonw.exe gives me:

应用程序错误指令在...引用内存在 0x00000000".无法写入内存".

Application Error "The instruction at ... referenced memory at 0x00000000". The memory could not be written".

特征非常稀疏,每次观察大约有 10 个,并且是二进制的(1 或 0),所以根据我的信封计算,我的 4 GB RAM 应该能够处理内存需求,但这并没有似乎是这样.只有当我使用较少的观察和/或较少的特征时,模型才适合.

The features are extremely sparse, about 10 per observation, and are binary (either 1 or 0), so by my back of the envelope calculation my 4 GB of RAM should be able to handle the memory requirements, but that doesn't appear to be the case. The models only fit when I use fewer observations and/or fewer features.

如果有的话,我想使用更多的观察和功能.我天真的理解是,在幕后运行的 liblinear 库能够支持这一点.有什么想法可以让我多加一些观察结果吗?

If anything, I would like to use even more observations and features. My naive understanding is that the liblinear library running things behind the scenes is capable of supporting that. Any ideas for how I might squeeze a few more observations in?

我的代码如下:

y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels
y = y_vectorizer.fit_transform(y)

x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)

clf = LogisticRegression()
clf.fit(x, y)

我传递给分析器的 features() 函数只返回一个字符串列表,指示在每个观察中检测到的特征.

The features() function I pass to analyzer just returns a list of strings indicating the features detected in each observation.

我使用的是 Python 2.7、sklearn 0.11、带有 4 GB RAM 的 Windows XP.

I'm using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.

推荐答案

liblinear(sklearn.linear_model.LogisticRegression 的支持实现)将托管自己的数据副本,因为它是一个 C++ 库其内部内存布局无法直接映射到 scipy 中预先分配的稀疏矩阵,例如 scipy.sparse.csr_matrixscipy.sparse.csc_matrix.

liblinear (the backing implementation of sklearn.linear_model.LogisticRegression) will host its own copy of the data because it is a C++ library whose internal memory layout cannot be directly mapped onto a pre-allocated sparse matrix in scipy such as scipy.sparse.csr_matrix or scipy.sparse.csc_matrix.

在您的情况下,我建议将您的数据加载为 scipy.sparse.csr_matrix 并将其提供给 sklearn.linear_model.SGDClassifier(带有 loss='log' 如果您想要逻辑回归模型和调用 predict_proba 方法的能力).SGDClassifier 不会复制输入数据,如果它已经使用了 scipy.sparse.csr_matrix 内存布局.

In your case I would recommend to load your data as a scipy.sparse.csr_matrix and feed it to a sklearn.linear_model.SGDClassifier (with loss='log' if you want a logistic regression model and the ability to call the predict_proba method). SGDClassifier will not copy the input data if it's already using the scipy.sparse.csr_matrix memory layout.

期望它在内存中分配 800 * (80000 + 1) * 8/(1024 ** 2) = 488MB 的密集模型(除了输入数据集的大小).

Expect it to allocate a dense model of 800 * (80000 + 1) * 8 / (1024 ** 2) = 488MB in memory (in addition to the size of your input dataset).

如何优化数据集的内存访问

how to optimize the memory access for your dataset

要在数据集提取后释放内存,您可以:

To free memory after dataset extraction you can:

x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
from sklearn.externals import joblib
joblib.dump(x.tocsr(), 'dataset.joblib')

然后退出这个python进程(强制完成内存释放)并在一个新进程中:

Then quit this python process (to force complete memory deallocation) and in a new process:

x_csr = joblib.load('dataset.joblib')

在 linux/OSX 下,您可以更有效地进行内存映射:

Under linux / OSX you could memory map that even more efficiently with:

x_csr = joblib.load('dataset.joblib', mmap_mode='c')

这篇关于Scikit-Learn 逻辑回归记忆错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆