sklearn中每个班级的特定测试/培训大小数 [英] Specific number of test/train size for each class in sklearn

查看:42
本文介绍了sklearn中每个班级的特定测试/培训大小数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据:

import pandas as pd
data = pd.DataFrame({'classes':[1,1,1,2,2,2,2],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})

我的代码:

import numpy as np
from sklearn.cross_validation import train_test_split
X = np.array(data[['b','c']])  
y = np.array(data['classes'])     
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=4)

问题:

train_test_split将从所有类中随机选择测试集.有什么方法可以为每个课程提供相同个测试集?(例如,来自第1类的两个数据和来自第2类的两个数据.请注意,每个类的总数不相等)

train_test_split will randomly choose test set from all the classes. Is there any way to have the same number of test set for each class? (For example, two data from class 1 and two data from class 2. Note that the total number of each classes are not equal)

预期结果:

y_test
array([1, 2, 2, 1], dtype=int64)

推荐答案

实际上,没有sklearn函数或参数可以直接执行此操作. stratify 按比例抽样 ,这并不是您在注释中所希望的.

There is actually no sklearn function or parameter to do this directly. The stratify samples proportionately, which is not what you want as you indicated in your comment.

您可以构建一个自定义函数,该函数相对较慢,但从绝对角度来看并不会非常慢.请注意,这是为熊猫对象构建的.

You can build a custom function, which is relatively slower but not tremendously slow on an absolute basis. Note that this is built for pandas objects.

def train_test_eq_split(X, y, n_per_class, random_state=None):
    if random_state:
        np.random.seed(random_state)
    sampled = X.groupby(y, sort=False).apply(
        lambda frame: frame.sample(n_per_class))
    mask = sampled.index.get_level_values(1)

    X_train = X.drop(mask)
    X_test = X.loc[mask]
    y_train = y.drop(mask)
    y_test = y.loc[mask]

    return X_train, X_test, y_train, y_test

示例:

data = pd.DataFrame({'classes': np.repeat([1, 2, 3], [10, 20, 30]),
                     'b': np.random.randn(60),
                     'c': np.random.randn(60)})
y = data.pop('classes')

X_train, X_test, y_train, y_test = train_test_eq_split(
    data, y, n_per_class=5, random_state=123)

y_test.value_counts()
# 3    5
# 2    5
# 1    5
# Name: classes, dtype: int64

工作原理:

  1. X 上进行分组,并从每个组中采样 n 个值.
  2. 获取此对象的内部索引.这是我们测试集的索引,它与原始数据的集合差异是我们的训练索引.
  1. Perform a groupby on X and sample n values from each group.
  2. Get the inner index of this object. This is the index for our test sets, and its set difference with the original data is our train index.

这篇关于sklearn中每个班级的特定测试/培训大小数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆