Scikit-learn:"y中人口最少的类只有1个成员" [英] Scikit-learn: "The least populated class in y has only 1 member"

查看:152
本文介绍了Scikit-learn:"y中人口最少的类只有1个成员"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Scikit学习进行随机森林回归.使用Pandas加载数据后的第一步是将数据分为测试集和训练集.但是,我得到了错误:

I am trying to do a Random Forest Regression using Scikit-learn. The first step after loading the data using Pandas is to split the data into a test set and a training set. However, I get the error:

y中人口最少的班只有1名成员

The least populated class in y has only 1 member

我已经搜索过Google并发现了该错误的各种实例,但我似乎仍然无法理解该错误的含义.

I've searched Google and found various instances of this error, but I still can't seem to get an understanding of what this error means.

training_file = "training_data.txt"
data = pd.read_csv(training_file, sep='\t')

y = data.Result
X = data.drop('Result', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100))

hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                'randomforestregressor__max_depth' : [None, 5, 3, 1] }

model = GridSearchCV(pipeline, hyperparameters, cv=10)

model.fit(X_train, y_train)

prediction = model.predict(X_test)

joblib.dump(model, 'ms5000.pkl')

train_test_split 方法产生以下堆栈跟踪:

The train_test_split method yields this stack trace:

Traceback (most recent call last):
    File "/Users/justin.shapiro/Desktop/IPML_Model/model_definition.py", line 18, in <module>
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.22, random_state=123, stratify=y)
  File "/Library/Python/2.7/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/Library/Python/2.7/site-packages/sklearn/model_selection/_split.py", line 953, in split
for train, test in self._iter_indices(X, y, groups):
  File "/Library/Python/2.7/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

这是我的数据集的一个示例:

This is a sample of my dataset:

var1    var2    var3    var4    var5    var6    var7    var8    Result
high    5000.0  0       60      1000    75      0.23    0.75    17912.0
mid     5000.0  0       60      1000    50      0.23    0.75    18707.0
low     5000.0  0       60      1000    25      0.23    0.75    17912.0
high    5000.0  5       60      1000    75      0.23    0.75    18577.0
mid     5000.0  5       60      1000    50      0.23    0.75    19407.0
low     5000.0  5       60      1000    25      0.23    0.75    18577.0

这是什么错误,我该如何解决?

What is this error and how can I get rid of it?

推荐答案

此行出现错误:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.22, random_state=123, stratify=y)

尝试删除 stratify = y

这篇关于Scikit-learn:"y中人口最少的类只有1个成员"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆