使用 Scikit-learn 进行拟合时出现 Python MemoryError [英] Python MemoryError when doing fitting with Scikit-learn
问题描述
我在具有 24GB 内存的 Windows 8 64 位系统上运行 Python 2.7(64 位).在对通常的 Sklearn.linear_models.Ridge
进行拟合时,代码运行良好.
I am running Python 2.7 (64-bit) on a Windows 8 64-bit system with 24GB memory. When doing the fitting of the usual Sklearn.linear_models.Ridge
, the code runs fine.
问题:但是,当使用 Sklearn.linear_models.RidgeCV(alphas=alphas)
进行拟合时,我遇到了显示的 MemoryError
错误下面在执行拟合过程的 rr.fit(X_train, y_train)
行.
Problem: However when using Sklearn.linear_models.RidgeCV(alphas=alphas)
for the fitting, I run into the MemoryError
error shown below on the line rr.fit(X_train, y_train)
that executes the fitting procedure.
我该如何避免这个错误?
How can I prevent this error?
代码片段
def fit(X_train, y_train):
alphas = [1e-3, 1e-2, 1e-1, 1e0, 1e1]
rr = RidgeCV(alphas=alphas)
rr.fit(X_train, y_train)
return rr
rr = fit(X_train, y_train)
错误
MemoryError Traceback (most recent call last)
<ipython-input-41-a433716e7179> in <module>()
1 # Fit Training set
----> 2 rr = fit(X_train, y_train)
<ipython-input-35-9650bd58e76c> in fit(X_train, y_train)
3
4 rr = RidgeCV(alphas=alphas)
----> 5 rr.fit(X_train, y_train)
6
7 return rr
C:Python27libsite-packagessklearnlinear_model
idge.pyc in fit(self, X, y, sample_weight)
696 gcv_mode=self.gcv_mode,
697 store_cv_values=self.store_cv_values)
--> 698 estimator.fit(X, y, sample_weight=sample_weight)
699 self.alpha_ = estimator.alpha_
700 if self.store_cv_values:
C:Python27libsite-packagessklearnlinear_model
idge.pyc in fit(self, X, y, sample_weight)
608 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
609
--> 610 v, Q, QT_y = _pre_compute(X, y)
611 n_y = 1 if len(y.shape) == 1 else y.shape[1]
612 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
C:Python27libsite-packagessklearnlinear_model
idge.pyc in _pre_compute_svd(self, X, y)
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
C:Python27libsite-packagesscipysparsecompressed.pyc in toarray(self, order, out)
559 def toarray(self, order=None, out=None):
560 """See the docstring for `spmatrix.toarray`."""
--> 561 return self.tocoo(copy=False).toarray(order=order, out=out)
562
563 ##############################################################
C:Python27libsite-packagesscipysparsecoo.pyc in toarray(self, order, out)
236 def toarray(self, order=None, out=None):
237 """See the docstring for `spmatrix.toarray`."""
--> 238 B = self._process_toarray_args(order, out)
239 fortran = int(B.flags.f_contiguous)
240 if not fortran and not B.flags.c_contiguous:
C:Python27libsite-packagesscipysparsease.pyc in _process_toarray_args(self, order, out)
633 return out
634 else:
--> 635 return np.zeros(self.shape, dtype=self.dtype, order=order)
636
637
MemoryError:
<小时>
代码
print type(X_train)
print X_train.shape
结果
<class 'scipy.sparse.csr.csr_matrix'>
(183576, 101507)
推荐答案
看看这部分堆栈跟踪:
531 def _pre_compute_svd(self, X, y):
532 if sparse.issparse(X) and hasattr(X, 'toarray'):
--> 533 X = X.toarray()
534 U, s, _ = np.linalg.svd(X, full_matrices=0)
535 v = s ** 2
您使用的算法依赖于 numpy 的线性代数例程来执行 SVD.但是那些不能处理稀疏矩阵,所以作者只是将它们转换为常规的非稀疏数组.为此必须做的第一件事是分配一个全零数组,然后用稀疏存储在稀疏矩阵中的值填充适当的点.听起来很容易,但让我们数学.float64(默认 dtype,如果您不知道您在使用什么,您可能正在使用它)元素需要 8 个字节.因此,根据您提供的数组形状,新的零填充数组将是:
The algorithm you're using relies on numpy's linear algebra routines to do SVD. But those can't handle sparse matrices, so the author simply converts them to regular non-sparse arrays. The first thing that has to happen for this is to allocate an all-zero array and then fill in the appropriate spots with the values sparsely stored in the sparse matrix. Sounds easy enough, but let's math. A float64 (the default dtype, which you're probably using if you don't know what you're using) element takes 8 bytes. So, based on the array shape you've provided, the new zero-filled array will be:
183576 * 101507 * 8 = 149,073,992,256 ~= 150 gigabytes
您系统的内存管理器可能看了一眼那个分配请求并自杀了.但是你能做些什么呢?
Your system's memory manager probably took one look at that allocation request and committed suicide. But what can you do about it?
首先,这看起来是相当可笑的功能数量.我对你的问题域或你的特征一无所知,但我的直觉反应是你需要在这里进行一些降维.
First off, that looks like a fairly ridiculous number of features. I don't know anything about your problem domain or what your features are, but my gut reaction is that you need to do some dimensionality reduction here.
其次,您可以尝试修复算法对稀疏矩阵的错误处理.它在 numpy.linalg.svd
上窒息,所以你可以使用 scipy.sparse.linalg.svds
代替.我不知道有问题的算法,但它可能不适合稀疏矩阵.即使您使用适当的稀疏线性代数例程,它也可能会生成(或在内部使用)一些大小与您的数据相似的非稀疏矩阵.使用稀疏矩阵表示来表示非稀疏数据只会导致使用比最初更多的空间,因此这种方法可能不起作用.谨慎行事.
Second, you can try to fix the algorithm's mishandling of sparse matrices. It's choking on numpy.linalg.svd
here, so you might be able to use scipy.sparse.linalg.svds
instead. I don't know the algorithm in question, but it might not be amenable to sparse matrices. Even if you use the appropriate sparse linear algebra routines, it might produce (or internally use) some non-sparse matrices with sizes similar to your data. Using a sparse matrix representation to represent non-sparse data will only result in using more space than you would have originally, so this approach might not work. Proceed with caution.
这篇关于使用 Scikit-learn 进行拟合时出现 Python MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!