sklearn train_test_split on pandas 按多列分层 [英] sklearn train_test_split on pandas stratify by multiple columns

查看:156
本文介绍了sklearn train_test_split on pandas 按多列分层的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 sklearn 的一个相对较新的用户,并且在 sklearn.model_selection 的 train_test_split 中遇到了一些意外行为.我有一个 Pandas 数据框,我想将它分成训练集和测试集.我想将我的数据按至少 2 列分层,但最好在我的数据框中按 4 列分层.

I'm a relatively new user to sklearn and have run into some unexpected behavior in train_test_split from sklearn.model_selection. I have a pandas dataframe that I would like to split into a training and test set. I would like to stratify my data by at least 2, but ideally 4 columns in my dataframe.

当我尝试这样做时,sklearn 没有发出警告,但是后来我发现我的最终数据集中有重复的行.我创建了一个示例测试来展示这种行为:

There were no warnings from sklearn when I tried to do this, however I found later that there were repeated rows in my final data set. I created a sample test to show this behavior:

from sklearn.model_selection import train_test_split
a = np.array([i for i in range(1000000)])
b = [i%10 for i in a]
c = [i%5 for i in a]
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

如果我按任一列分层,它似乎按预期工作:

It seems to work as expected if I stratify by either column:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 800000

但是当我尝试按两列进行分层时,我得到重复的值:

But when I try to stratify by both columns, I get repeated values:

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b', 'c']])
print(len(train.a.values))  # prints 800000
print(len(set(train.a.values)))  # prints 640000

推荐答案

你得到重复的原因是因为 train_test_split() 最终将层定义为唯一的一组值em> 传递给 stratify 参数的任何内容.由于层是由两列定义的,一行数据可能代表多个层,因此抽样可能会选择同一行两次,因为它认为它是从不同的类中抽样的.

The reason you're getting duplicates is because train_test_split() eventually defines strata as the unique set of values of whatever you passed into the stratify argument. Since strata are defined from two columns, one row of data may represent more than one stratum, and so sampling may choose the same row twice because it thinks it's sampling from different classes.

train_test_split() 函数 调用 StratifiedShuffleSplit,其中 使用 np.unique() on y(这是您通过 stratify 传入的).来自源代码:

The train_test_split() function calls StratifiedShuffleSplit, which uses np.unique() on y (which is what you pass in via stratify). From the source code:

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

这是一个简化的示例案例,是您提供的示例的一个变体:

Here's a simplified sample case, a variation on the example you provided:

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

print(df)
     a    b  c
0    0  bar  y
1    1  foo  y
2    2  bar  z
3    3  bar  y
4    4  foo  z
5    5  bar  y
...

分层函数认为有四个类可以拆分:foobaryz.但是由于这些类本质上是嵌套的,这意味着 yz 都出现在 b == foob == bar,当拆分器尝试从每个类中采样时,我们会得到重复项.

The stratification function thinks there are four classes to split on: foo, bar, y, and z. But since these classes are essentially nested, meaning y and z both show up in b == foo and b == bar, we'll get duplicates when the splitter tries to sample from each class.

train, test = train_test_split(df, test_size=0.2, random_state=0, 
                               stratify=df[['b', 'c']])
print(len(train.a.values))  # 16
print(len(set(train.a.values)))  # 12

print(train)
     a    b  c
3    3  bar  y   # selecting a = 3 for b = bar*
5    5  bar  y
13  13  foo  y
4    4  foo  z
14  14  bar  z
10  10  foo  z
3    3  bar  y   # selecting a = 3 for c = y
6    6  bar  y
16  16  foo  y
18  18  bar  z
6    6  bar  y
8    8  foo  y
18  18  bar  z
7    7  bar  z
4    4  foo  z
19  19  bar  y

#* We can't be sure which row is selecting for `bar` or `y`, 
#  I'm just illustrating the idea here.

这里有一个更大的设计问题:你想使用嵌套分层抽样,还是你真的只想处理 df.bdf.c 中的每个类> 作为一个单独的类来采样?如果是后者,那就是你已经得到的.前者更复杂,这不是 train_test_split 设置的目的.

There's a larger design question here: Do you want to used nested stratified sampling, or do you actually just want to treat each class in df.b and df.c as a separate class to sample from? If the latter, that's what you're already getting. The former is more complicated, and that's not what train_test_split is set up to do.

您可能会发现关于嵌套分层抽样的讨论很有用.

这篇关于sklearn train_test_split on pandas 按多列分层的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆