ValueError: n_splits=10 不能大于每个类的成员数 [英] ValueError: n_splits=10 cannot be greater than the number of members in each class

查看:31
本文介绍了ValueError: n_splits=10 不能大于每个类的成员数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行以下代码:

I am trying to run the following code:

from sklearn.model_selection import StratifiedKFold 
X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "r", "n", "n", "n", "r"]

skf = StratifiedKFold(n_splits=10)

for train, test in skf.split(X,y):  
    print("%s %s" % (train,test))

但我收到以下错误:

ValueError: n_splits=10 cannot be greater than the number of members in each class.

我看过这里scikit-learn 错误:y 中人口最少的类只有 1 个成员,但我仍然不确定我的代码有什么问题.

I have looked here scikit-learn error: The least populated class in y has only 1 member but I'm still not really sure what is wrong with my code.

我的列表的长度均为 14 print(len(X)) print(len(y)).

My lists both have lengths of 14 print(len(X)) print(len(y)).

我的部分困惑是我不确定 members 的定义以及 class 在这种情况下是什么.

Part of my confusion is that I am not sure what a members is defined as and what a class is in this context.

问题:如何修复错误?什么是会员?什么是班级?(在这种情况下)

Questions: How do I fix the error? What is a member? What is a class? (in this context)

推荐答案

分层是指保持每个折叠中每个类的比例.因此,如果您的原始数据集有 3 个类别,比例分别为 60%、20% 和 20%,那么分层将尝试在每个折叠中保持该比例.

Stratification means to keep the ratio of each class in each fold. So if your original dataset has 3 classes in the ratio of 60%, 20% and 20% then stratification will try to keep that ratio in each fold.

就你而言,

X = ["hey", "join now", "hello", "join today", "join us now", "not today",
     "join this trial", " hey hey", " no", "hola", "bye", "join today", 
     "no","join join"]
y = ["n", "r", "n", "r", "r", "n", "n", "n", "n", "y", "n", "n", "n", "y"]

您共有 14 个样本(成员)具有分布:

You have a total of 14 samples (members) with the distribution:

class    number of members         percentage
 'n'        9                        64
 'r'        3                        22
 'y'        2                        14

因此 StratifiedKFold 将尝试在每个折叠中保持该比例.现在您已经指定了 10 折 (n_splits).所以这意味着在一次折叠中,对于y"类来保持比例,至少有 2/10 = 0.2 个成员.但是我们不能提供少于 1 个成员(样本),这就是为什么它会在那里抛出错误.

So StratifiedKFold will try to keep that ratio in each fold. Now you have specified 10 folds (n_splits). So that means in a single fold, for class 'y' to maintain the ratio, at least 2 / 10 = 0.2 members. But we cannot give less than 1 member (sample) so that's why its throwing an error there.

如果您设置了 n_splits=2 而不是 n_splits=10,那么它会起作用,因为 'y' 的成员数将是 2/2 = 1.要使 n_splits = 10 正常工作,每个类至少需要 10 个样本.

If instead of n_splits=10, you have set n_splits=2, then it would have worked, because than the number of members for 'y' will be 2 / 2 = 1. For n_splits = 10 to work correctly, you need to have atleast 10 samples for each of your classes.

这篇关于ValueError: n_splits=10 不能大于每个类的成员数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆