ValueError:n_splits = 10不能大于每个类中的成员数 [英] ValueError: n_splits=10 cannot be greater than the number of members in each class

查看:925
本文介绍了ValueError:n_splits = 10不能大于每个类中的成员数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行以下代码:

I am trying to run the following code:

from sklearn.model_selection import StratifiedKFold 
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]

skf = StratifiedKFold(n_splits=10)

for train, test in skf.split(X,y):  
    print("%s %s" % (train,test))

但是出现以下错误:

ValueError: n_splits=10 cannot be greater than the number of members in each class.

我在这里看过 scikit-learn错误:y中人口最少的类只有1个成员,但我仍然不太确定我的代码有什么问题。

I have looked here scikit-learn error: The least populated class in y has only 1 member but I'm still not really sure what is wrong with my code.

我的列表的长度均为14 print(len(X)) print( len(y))

My lists both have lengths of 14 print(len(X)) print(len(y)).

我感到困惑的部分原因是我不确定成员的定义是什么以及 class 在这种情况下。

Part of my confusion is that I am not sure what a members is defined as and what a class is in this context.

问题:如何解决该错误?什么是会员?什么是课程? (在这种情况下)

Questions: How do I fix the error? What is a member? What is a class? (in this context)

推荐答案

分层是指在每个折叠中保持每个类的比率。因此,如果您的原始数据集有3个类别,比例分别为60%,20%和20%,那么分层将尝试在每个折叠中保持该比例。

Stratification means to keep the ratio of each class in each fold. So if your original dataset has 3 classes in the ratio of 60%, 20% and 20% then stratification will try to keep that ratio in each fold.

在您的情况下,

X = ["hey", "join now", "hello", "join today", "join us now", "not today",
     "join this trial", " hey hey", " no", "hola", "bye", "join today", 
     "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]

您总共有14个样本(成员),分布如下:

You have a total of 14 samples (members) with the distribution:

class    number of members         percentage
 'n'        9                        64
 'r'        3                        22
 'y'        2                        14

因此StratifiedKFold将尝试保持该比例。现在,您已指定10折(n_splits)。因此,这意味着对于 y类,要保持这一比例,至少要有2/10 = 0.2个成员。但是我们不能给出少于1个成员(样本),这就是为什么它会在其中抛出错误。

So StratifiedKFold will try to keep that ratio in each fold. Now you have specified 10 folds (n_splits). So that means in a single fold, for class 'y' to maintain the ratio, at least 2 / 10 = 0.2 members. But we cannot give less than 1 member (sample) so that's why its throwing an error there.

If而不是 n_splits = 10 ,您已经设置了 n_splits = 2 ,那么它就可以了,因为'y'的成员数量为2/2 =1。对于 n_splits = 10 才能正常工作,每个类至少需要有10个样本。

If instead of n_splits=10, you have set n_splits=2, then it would have worked, because than the number of members for 'y' will be 2 / 2 = 1. For n_splits = 10 to work correctly, you need to have atleast 10 samples for each of your classes.

这篇关于ValueError:n_splits = 10不能大于每个类中的成员数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆