与Scikit-learn配合时发生Python MemoryError [英] Python MemoryError when doing fitting with Scikit-learn

查看:331
本文介绍了与Scikit-learn配合时发生Python MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在具有24GB内存的Windows 8 64位系统上运行Python 2.7(64位).当对通常的Sklearn.linear_models.Ridge进行拟合时,代码可以正常运行.

I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge, the code runs fine.

问题:但是,当使用Sklearn.linear_models.RidgeCV(alphas=alphas)进行拟合时,我在执行拟合过程的rr.fit(X_train, y_train)行中遇到了下面显示的MemoryError错误.

Problem: However when using Sklearn.linear_models.RidgeCV(alphas=alphas) for the fitting, I run into the MemoryError error shown below on the line rr.fit(X_train, y_train) that executes the fitting procedure.

如何防止此错误?

代码段

def fit(X_train, y_train):
    alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]

    rr = RidgeCV(alphas=alphas)
    rr.fit(X_train, y_train)

    return rr


rr = fit(X_train, y_train)

错误

MemoryError                               Traceback (most recent call last)
<ipython-input-41-a433716e7179> in <module>()
      1 # Fit Training set
----> 2 rr = fit(X_train, y_train)

<ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
      3 
      4     rr = RidgeCV(alphas=alphas)
----> 5     rr.fit(X_train, y_train)
      6 
      7     return rr

C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
    696                                   gcv_mode=self.gcv_mode,
    697                                   store_cv_values=self.store_cv_values)
--> 698             estimator.fit(X, y, sample_weight=sample_weight)
    699             self.alpha_ = estimator.alpha_
    700             if self.store_cv_values:

C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
    608             raise ValueError('bad gcv_mode "%s"' % gcv_mode)
    609 
--> 610         v, Q, QT_y = _pre_compute(X, y)
    611         n_y = 1 if len(y.shape) == 1 else y.shape[1]
    612         cv_values = np.zeros((n_samples * n_y, len(self.alphas)))

C:\Python27\lib\site-packages\sklearn\linear_model\ridge.pyc in _pre_compute_svd(self, X, y)
    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

C:\Python27\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self, order, out)
    559     def toarray(self, order=None, out=None):
    560         """See the docstring for `spmatrix.toarray`."""
--> 561         return self.tocoo(copy=False).toarray(order=order, out=out)
    562 
    563     ##############################################################

C:\Python27\lib\site-packages\scipy\sparse\coo.pyc in toarray(self, order, out)
    236     def toarray(self, order=None, out=None):
    237         """See the docstring for `spmatrix.toarray`."""
--> 238         B = self._process_toarray_args(order, out)
    239         fortran = int(B.flags.f_contiguous)
    240         if not fortran and not B.flags.c_contiguous:

C:\Python27\lib\site-packages\scipy\sparse\base.pyc in _process_toarray_args(self, order, out)
    633             return out
    634         else:
--> 635             return np.zeros(self.shape, dtype=self.dtype, order=order)
    636 
    637 

MemoryError: 


代码

print type(X_train)
print X_train.shape

结果

<class 'scipy.sparse.csr.csr_matrix'>
(183576, 101507)

推荐答案

看看堆栈跟踪的这一部分:

Take a look at this part of your stack trace:

    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

您使用的算法依赖于numpy的线性代数例程来执行SVD.但是那些不能处理稀疏矩阵,因此作者只是将它们转换为常规的非稀疏数组.为此,必须要做的第一件事是分配全零数组,然后使用稀疏矩阵中稀疏存储的值填充适当的点.听起来很简单,但是让我们算一下吧. float64(默认dtype,如果您不知道使用的是什么,可能会使用它)占用8个字节.因此,根据您提供的数组形状,新的零填充数组将为:

The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:

183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes

您的系统的内存管理器可能只看了一次分配请求并自杀了.但是你能做什么呢?

Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?

首先,这看起来像是相当荒谬的功能.我对您的问题域或功能不了解,但是我的直觉是您需要在此处进行一些降维.

First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.

第二,您可以尝试修复算法对稀疏矩阵的错误处理.在这里numpy.linalg.svd令人窒息,因此您可以使用 scipy.sparse.linalg.svds 代替.我不知道所讨论的算法,但是它可能不适用于稀疏矩阵.即使您使用适当的稀疏线性代数例程,它也可能会生成(或内部使用)一些大小与您的数据相似的非稀疏矩阵.使用稀疏矩阵表示来表示非稀疏数据只会导致比最初使用更多的空间,因此这种方法可能行不通.谨慎行事.

Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd here, so you might be able to use scipy.sparse.linalg.svds instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.

这篇关于与Scikit-learn配合时发生Python MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆