scikit-learn 错误:y 中人口最少的类只有 1 个成员 [英] scikit-learn error: The least populated class in y has only 1 member

查看:54
本文介绍了scikit-learn 错误:y 中人口最少的类只有 1 个成员的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 train_test_split 来自 scikit-learn 的函数,但我收到此错误:

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

然而,所有类都至少有 15 个样本.为什么我收到这个错误?

However, all classes have at least 15 samples. Why am I getting this error?

X 是一个表示数据点的 Pandas DataFrame,y 是一个 Pandas DataFrame,其中一列包含目标变量.

X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable.

我无法发布原始数据,因为它是专有的,但是通过创建一个具有 1k 行 x 500 列的随机 Pandas DataFrame (X) 和一个具有相同行数 (1k) 的 X,以及每一行的目标变量(分类标签).y pandas DataFrame 应具有不同的分类标签(例如class1"、class2"...),并且每个标签应至少出现 15 次.

I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label). The y pandas DataFrame should have different categorical labels (e.g. 'class1', 'class2'...) and each labels should have at least 15 occurrences.

推荐答案

问题是 train_test_split 将 2 个数组作为输入,但是 y 数组是一个-列矩阵.如果我只传递 y 的第一列,它会起作用.

The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix. If I pass only the first column of y it works.

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])

这篇关于scikit-learn 错误:y 中人口最少的类只有 1 个成员的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆