scikit-learn错误:y中人口最少的类只有1个成员 [英] scikit-learn error: The least populated class in y has only 1 member

查看:2939
本文介绍了scikit-learn错误:y中人口最少的类只有1个成员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 train_test_split 函数,但出现此错误:

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

但是,所有类至少有15个样本.为什么会出现此错误?

However, all classes have at least 15 samples. Why am I getting this error?

X是代表数据点的pandas DataFrame,y是具有一列包含目标变量的pandas DataFrame.

X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable.

我无法发布原始数据,因为它是专有数据,但是通过创建具有1k行x 500列的随机pandas DataFrame(X)和具有相同行数(1k)的随机pandas DataFrame(y),可以相当复制),并为每一行指定目标变量(分类标签). y pandas DataFrame应该具有不同的分类标签(例如'class1','class2'...),每个标签应至少出现15次.

I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label). The y pandas DataFrame should have different categorical labels (e.g. 'class1', 'class2'...) and each labels should have at least 15 occurrences.

推荐答案

问题是train_test_split将2个数组作为输入,但是y数组是一个单列矩阵.如果我仅传递y的第一列,那么它将起作用.

The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix. If I pass only the first column of y it works.

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])

这篇关于scikit-learn错误:y中人口最少的类只有1个成员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆