ValueError:带有稀疏矩阵的scikit线性回归CV模型中不允许使用负维度 [英] ValueError: negative dimensions are not allowed in scikit linear regression CV model with sparse matrices

查看:385
本文介绍了ValueError:带有稀疏矩阵的scikit线性回归CV模型中不允许使用负维度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近参加了一次kaggle竞赛,在尝试从scikit learning运行线性CV模型时遇到了问题.我知道有关堆栈溢出的类似问题,但是我看不到已接受的回复与我的问题有何关系.任何帮助将不胜感激.我的代码如下:

I recently competed in a kaggle competition and ran into problems trying to run linear CV models from scikit learn. I am aware of a similar question on stack overflow but I can't see how the accepted reply relates to my issue. Any assistance would be greatly appreciated. My code is given below:

train=pd.read_csv(".../train.csv")
test=pd.read_csv(".../test.csv")
data=pd.read_csv(".../sampleSubmission.csv")

from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(max_features=None)
Y=transformer.fit_transform(train.tweet)
Z=transformer.transform(test.tweet)

from sklearn import linear_model

clf = linear_model.RidgeCV()

a=4
b=1
while (a<28):
    clf.fit(Y, train.ix[:,a])
    pred=clf.predict(Z)
    linpred=pd.DataFrame(pred)
    data[data.columns[b]]=linpred
    b=b+1
    a=a+1
print b

我收到的错误总计粘贴在下面:

The error that I receive is pasted in total below:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-41c31233c15c> in <module>()
      1 blah=train.ix[:,a]
----> 2 clf.fit(Y, blah)

D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-        packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
    815                                   gcv_mode=self.gcv_mode,
    816                                   store_cv_values=self.store_cv_values)
--> 817             estimator.fit(X, y, sample_weight=sample_weight)
    818             self.alpha_ = estimator.alpha_
    819             if self.store_cv_values:

D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-    packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
    722             raise ValueError('bad gcv_mode "%s"' % gcv_mode)
    723 
--> 724         v, Q, QT_y = _pre_compute(X, y)
    725         n_y = 1 if len(y.shape) == 1 else y.shape[1]
    726         cv_values = np.zeros((n_samples * n_y, len(self.alphas)))

D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-  packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)
    607     def _pre_compute(self, X, y):
    608         # even if X is very sparse, K is usually very dense
--> 609         K = safe_sparse_dot(X, X.T, dense_output=True)
    610         v, Q = linalg.eigh(K)
    611         QT_y = np.dot(Q.T, y)

D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
     76     from scipy import sparse
     77     if sparse.issparse(a) or sparse.issparse(b):
---> 78         ret = a * b
     79         if dense_output and hasattr(ret, "toarray"):
     80             ret = ret.toarray()

D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)
    301             if self.shape[1] != other.shape[0]:
    302                 raise ValueError('dimension mismatch')
--> 303             return self._mul_sparse_matrix(other)
    304 
    305         try:

D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-  packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)
    518 
    519         nnz = indptr[-1]
--> 520         indices = np.empty(nnz, dtype=np.intc)
    521         data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))
    522 

ValueError: negative dimensions are not allowed

推荐答案

似乎在不使用sklearn的情况下会出现此问题.它在scipy.sparse矩阵乘法中. scipy-users板上存在此问题:

It looks like this problem occurs without using sklearn. Its in scipy.sparse matrix multiplication. There is this issue on a scipy-users board: sparse matrix multiplication problem. The crux of the problem is that scipy uses a 32-bit int for non-zero indices during sparse matrix multiplication. That's the marked line at the bottom of the traceback above. That can overflow if there are too many non-zero elements. That overflow causes the variable nnz to become negative. Then the code at the last arrow creates an empty array of size nnz, resulting in a ValueError due to a negative dimension.

您可以在不使用sklearn的情况下生成上述回溯的尾端,如下所示:

You can generate the tail end of the traceback above without sklearn as follows:

import scipy.sparse as ss
X = ss.rand(75000, 42000, format='csr', density=0.01)
X * X.T

对于此问题,输入可能非常稀疏,但RidgeCV看起来像是在sklearn中的回溯的最后一部分中将X和X.T相乘.该产品可能不够稀疏.

For this problem, the input is probably quite sparse, but RidgeCV looks like its multiplying X and X.T in the last part of the traceback within sklearn. That product might not be sparse enough.

这篇关于ValueError:带有稀疏矩阵的scikit线性回归CV模型中不允许使用负维度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆