StratifiedShuffleSplit:ValueError:y中人口最少的类只有1个成员,这太少了. [英] StratifiedShuffleSplit: ValueError: The least populated class in y has only 1 member, which is too few.
问题描述
我正在使用StratifiedShuffleSplit交叉验证器来预测波士顿数据集中的房价.当我运行以下示例代码时.
I'm using the StratifiedShuffleSplit cross validator for predicting the house prices in the Boston dataset. When I run the below sample code.
def fit_model_S(labels, features,step, clf,parameters):
cv = StratifiedShuffleSplit(n_splits=2,test_size=0.10, random_state = 42)
print (cv)
for train_index, test_index in cv.split(features,labels):
labels_train, labels_test = labels[train_index], labels[test_index]
features_train, features_test = features[train_index], features[test_index]
我收到以下错误.该代码与ShuffleSplit一起使用.这意味着StratifiedShuffleSplit不能与数字标签一起使用.
I get the below error. The code works with ShuffleSplit.Does this mean that StratifiedShuffleSplit cannot be used with numeric labels.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-141-b290147edcbf> in <module>()
33 dt_steps = [('decision', clf)]
34
---> 35 fit_model_S(labels, features,dt_steps,clf,parameters4)
36
37
<ipython-input-141-b290147edcbf> in fit_model_S(labels, features, step, clf, parameters)
8 cv = StratifiedShuffleSplit(n_splits=2,test_size=0.10, random_state = 42)
9 print (cv)
---> 10 for train_index, test_index in cv.split(features,labels):
11
12 labels_train, labels_test = labels[train_index], labels[test_index]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
1194 """
1195 X, y, groups = indexable(X, y, groups)
-> 1196 for train, test in self._iter_indices(X, y, groups):
1197 yield train, test
1198
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in _iter_indices(self, X, y, groups)
1535 class_counts = np.bincount(y_indices)
1536 if np.min(class_counts) < 2:
-> 1537 raise ValueError("The least populated class in y has only 1"
1538 " member, which is too few. The minimum"
1539 " number of groups for any class cannot"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
数据集示例如下.
RM LSTAT PTRATIO MEDV
0 6.575 4.98 15.3 504000.0
1 6.421 9.14 17.8 453600.0
2 7.185 4.03 17.8 728700.0
3 6.998 2.94 18.7 701400.0
4 7.147 5.33 18.7 760200.0
在这种情况下,MEDV是标签.
The MEDV is the label in this case.
推荐答案
波士顿房屋数据是用于回归问题的数据集.您正在使用StratifiedShuffleSplit
将其分为训练和测试.在文档中提到的 的StratifiedShuffleSplit
是:>
Boston Housing data is a dataset for regression problem. You are using StratifiedShuffleSplit
to divide it into train and test. StratifiedShuffleSplit
as mentioned in docs is:
此交叉验证对象是StratifiedKFold和 ShuffleSplit,返回分层的随机褶皱.褶皱是 通过保留每个类别的样本百分比来制作.
This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
请查看最后一行:-保留每个类别的样本百分比".因此,StratifiedShuffleSplit
尝试将y
值视为单独的类.
Please look at the last line :- "preserving the percentage of samples for each class". So the StratifiedShuffleSplit
tries to see the y
values as individual classes.
但是这不可能,因为您的y
是回归变量(连续的数字数据).
But it will not be possible because your y
is a regression variable (continuous numerical data).
请查看ShuffleSplit或train_test_split来划分您的数据. 有关交叉验证的更多详细信息,请参见此处: http://scikit-learn .org/stable/modules/cross_validation.html#cross-validation
Please look at ShuffleSplit, or train_test_split to divide your data. See here for more details on cross-validation: http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
这篇关于StratifiedShuffleSplit:ValueError:y中人口最少的类只有1个成员,这太少了.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!