使用 Scikit-learn 进行拟合时出现 Python MemoryError [英] Python MemoryError when doing fitting with Scikit-learn

查看:37
本文介绍了使用 Scikit-learn 进行拟合时出现 Python MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在具有 24GB 内存的 Windows 8 64 位系统上运行 Python 2.7(64 位).在对通常的 Sklearn.linear_models.Ridge 进行拟合时,代码运行良好.

I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge, the code runs fine.

问题:但是,当使用 Sklearn.linear_models.RidgeCV(alphas=alphas) 进行拟合时,我遇到了显示的 MemoryError 错误下面在执行拟合过程的 rr.fit(X_train, y_train) 行.

Problem: However when using Sklearn.linear_models.RidgeCV(alphas=alphas) for the fitting, I run into the MemoryError error shown below on the line rr.fit(X_train, y_train) that executes the fitting procedure.

我该如何避免这个错误?

How can I prevent this error?

代码片段

def fit(X_train, y_train):
    alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]

    rr = RidgeCV(alphas=alphas)
    rr.fit(X_train, y_train)

    return rr


rr = fit(X_train, y_train)

错误

MemoryError                               Traceback (most recent call last)
<ipython-input-41-a433716e7179> in <module>()
      1 # Fit Training set
----> 2 rr = fit(X_train, y_train)

<ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
      3 
      4     rr = RidgeCV(alphas=alphas)
----> 5     rr.fit(X_train, y_train)
      6 
      7     return rr

C:Python27libsite-packagessklearnlinear_model
idge.pyc in fit(self, X, y, sample_weight)
    696                                   gcv_mode=self.gcv_mode,
    697                                   store_cv_values=self.store_cv_values)
--> 698             estimator.fit(X, y, sample_weight=sample_weight)
    699             self.alpha_ = estimator.alpha_
    700             if self.store_cv_values:

C:Python27libsite-packagessklearnlinear_model
idge.pyc in fit(self, X, y, sample_weight)
    608             raise ValueError('bad gcv_mode "%s"' % gcv_mode)
    609 
--> 610         v, Q, QT_y = _pre_compute(X, y)
    611         n_y = 1 if len(y.shape) == 1 else y.shape[1]
    612         cv_values = np.zeros((n_samples * n_y, len(self.alphas)))

C:Python27libsite-packagessklearnlinear_model
idge.pyc in _pre_compute_svd(self, X, y)
    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

C:Python27libsite-packagesscipysparsecompressed.pyc in toarray(self, order, out)
    559     def toarray(self, order=None, out=None):
    560         """See the docstring for `spmatrix.toarray`."""
--> 561         return self.tocoo(copy=False).toarray(order=order, out=out)
    562 
    563     ##############################################################

C:Python27libsite-packagesscipysparsecoo.pyc in toarray(self, order, out)
    236     def toarray(self, order=None, out=None):
    237         """See the docstring for `spmatrix.toarray`."""
--> 238         B = self._process_toarray_args(order, out)
    239         fortran = int(B.flags.f_contiguous)
    240         if not fortran and not B.flags.c_contiguous:

C:Python27libsite-packagesscipysparsease.pyc in _process_toarray_args(self, order, out)
    633             return out
    634         else:
--> 635             return np.zeros(self.shape, dtype=self.dtype, order=order)
    636 
    637 

MemoryError: 

<小时>

代码

print type(X_train)
print X_train.shape

结果

<class 'scipy.sparse.csr.csr_matrix'>
(183576, 101507)

推荐答案

看看这部分堆栈跟踪:

    531     def _pre_compute_svd(self, X, y):
    532         if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533             X = X.toarray()
    534         U, s, _ = np.linalg.svd(X, full_matrices=0)
    535         v = s ** 2

您使用的算法依赖于 numpy 的线性代数例程来执行 SVD.但是那些不能处理稀疏矩阵,所以作者只是将它们转换为常规的非稀疏数组.为此必须做的第一件事是分配一个全零数组,然后用稀疏存储在稀疏矩阵中的值填充适当的点.听起来很容易,但让我们数学.float64(默认 dtype,如果您不知道您在使用什么,您可能正在使用它)元素需要 8 个字节.因此,根据您提供的数组形状,新的零填充数组将是:

The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:

183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes

您系统的内存管理器可能看了一眼那个分配请求并自杀了.但是你能做些什么呢?

Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?

首先,这看起来是相当可笑的功能数量.我对你的问题域或你的特征一无所知,但我的直觉反应是你需要在这里进行一些降维.

First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.

其次,您可以尝试修复算法对稀疏矩阵的错误处理.它在 numpy.linalg.svd 上窒息,所以你可以使用 scipy.sparse.linalg.svds 代替.我不知道有问题的算法,但它可能不适合稀疏矩阵.即使您使用适当的稀疏线性代数例程,它也可能会生成(或在内部使用)一些大小与您的数据相似的非稀疏矩阵.使用稀疏矩阵表示来表示非稀疏数据只会导致使用比最初更多的空间,因此这种方法可能不起作用.谨慎行事.

Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd here, so you might be able to use scipy.sparse.linalg.svds instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.

这篇关于使用 Scikit-learn 进行拟合时出现 Python MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆