找到具有 0 个样本的数组(形状=(0, 40)),而最少需要 1 个 [英] Found array with 0 sample(s) (shape=(0, 40)) while a minimum of 1 is required
问题描述
我正在使用 Python 2.7、sklearn 0.17.1、numpy 1.11.0 测试一个简单的预测程序.我从 LDA 模型中得到了具有概率的矩阵,现在我想创建 RandomForestClassifier 来通过概率预测结果.我的代码是:
I'm testing a simple prediction program with Python 2.7, sklearn 0.17.1, numpy 1.11.0. I got matrix with propabilities from LDA model, and now I want create RandomForestClassifier to predict results by propabilities. My code is:
maxlen = 40
props = []
for doc in corpus:
topics = model.get_document_topics(doc)
tprops = [0] * maxlen
for topic in topics:
tprops[topics[0]] = topics[1]
props.append(tprops)
ntheta = np.array(props)
ny = np.array(y)
clf = RandomForestClassifier(n_estimators=100)
accuracy = cross_val_score(clf, ntheta, ny, scoring = 'accuracy')
print accuracy
<小时>
ValueError Traceback (most recent call last)
<ipython-input-65-a7d276df43e9> in <module>()
1 # clf.fit(nteta, ny)
2 print nteta.shape, ny.shape
----> 3 accuracy = cross_val_score(clf, nteta, ny, scoring = 'accuracy')
4 print accuracy
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1431 train, test, verbose, None,
1432 fit_params)
-> 1433 for train, test in cv)
1434 return np.array(scores)[:, 0]
1435
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
798 # was dispatched. In particular this covers the edge
799 # case of Parallel used with an exhausted iterator.
--> 800 while self.dispatch_one_batch(iterator):
801 self._iterating = True
802 else:
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
656 return False
657 else:
--> 658 self._dispatch(tasks)
659 return True
660
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in _dispatch(self, batch)
564
565 if self._pool is None:
--> 566 job = ImmediateComputeBatch(batch)
567 self._jobs.append(job)
568 self.n_dispatched_batches += 1
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, batch)
178 # Don't delay the application, to avoid keeping the input
179 # arguments in memory
--> 180 self.results = batch()
181
182 def get(self):
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1529 estimator.fit(X_train, **fit_params)
1530 else:
-> 1531 estimator.fit(X_train, y_train, **fit_params)
1532
1533 except Exception as e:
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/home/egor/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
405 " minimum of %d is required%s."
406 % (n_samples, shape_repr, ensure_min_samples,
--> 407 context))
408
409 if ensure_min_features > 0 and array.ndim == 2:
ValueError: Found array with 0 sample(s) (shape=(0, 40)) while a minimum of 1 is required.
<小时>
更新因为我得到了 2 减?让批评家有建设性.
UPD For what I got 2 minus? Let critic be constructive.
更新
cotique 发现 y 填写不正确(必须是其他类).如果 y 填写正确,则问题不会发生.在我的例子中,类是错误的,它们的数量是 39774.但理论上这不是一个答案,为什么当我们有 39774 个类并且必须预测它们时会发生错误.
cotique found that y was filled incorrect (must be other classes). And if y fills correct then the problem doesn't happens. In my case classes were wrong and their count were 39774. But in theory it's not an answer, why the error happens when we have 39774 classes and have to predict them.
推荐答案
这是来自 scikit-learn repo (validation.py#L409):
This is the original code from the scikit-learn repo (validation.py#L409):
if ensure_min_samples > 0:
n_samples = _num_samples(array)
if n_samples < ensure_min_samples:
raise ValueError("Found array with %d sample(s) (shape=%s) while a"
" minimum of %d is required%s."
% (n_samples, shape_repr, ensure_min_samples,
context))
所以,n_samples = _num_samples(array)
.顺便说一下,array
是要检查/转换的输入对象
.
So, the n_samples = _num_samples(array)
. By the way, array
is the input object to check / convert
.
接下来,validation.py#L111:
def _num_samples(x):
"""Return number of samples in array-like x."""
if hasattr(x, 'fit'):
# stuff
if not hasattr(x, '__len__') and not hasattr(x, 'shape'):
# stuff
if hasattr(x, 'shape'):
if len(x.shape) == 0:
# raise TypeError
return x.shape[0]
else:
return len(x)
因此,样本数等于array
的第一维长度,即0
,因为array.shape = (0, 40)代码>.
So, the number of samples equals to the length of first dimension of array
, which is 0
since array.shape = (0, 40)
.
我不知道这一切意味着什么,但我希望它能让事情更清楚.
And I don't know what this all means, but I hope it makes things clearer.
这篇关于找到具有 0 个样本的数组(形状=(0, 40)),而最少需要 1 个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!