如何生成基于组ID的Train-Test-Split? [英] How to generate a train-test-split based on a group id?
本文介绍了如何生成基于组ID的Train-Test-Split?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下数据:
pd.DataFrame({'Group_ID':[1,1,1,2,2,2,3,4,5,5],
'Item_id':[1,2,3,4,5,6,7,8,9,10],
'Target': [0,0,1,0,1,1,0,0,0,1]})
Group_ID Item_id Target
0 1 1 0
1 1 2 0
2 1 3 1
3 2 4 0
4 2 5 1
5 2 6 1
6 3 7 0
7 4 8 0
8 5 9 0
9 5 10 1
我需要根据"Group_ID"将数据集分为训练和测试集,以便80%的数据进入训练集,而20%的数据进入测试集.
I need to split the dataset into a training and testing set based on the "Group_ID" so that 80% of the data goes into a training set and 20% into a test set.
也就是说,我需要训练集看起来像这样:
That is, I need my training set to look something like:
Group_ID Item_id Target
0 1 1 0
1 1 2 0
2 1 3 1
3 2 4 0
4 2 5 1
5 2 6 1
6 3 7 0
7 4 8 0
测试集:
Test Set
Group_ID Item_id Target
8 5 9 0
9 5 10 1
最简单的方法是什么?据我所知,sklearn中的标准test_train_split函数不支持按组拆分,因为我也可以指出拆分的大小(例如80/20).
What would be the simplest way to do this? As far as I know, the standard test_train_split function in sklearn does not support splitting by groups in a way where I can also indicate the size of the split (e.g. 80/20).
推荐答案
我想出了答案.这似乎可行:
I figured out the answer. This seems to work:
train_inds, test_inds = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7).split(df, groups=df['Group_Id']))
train = df.iloc[train_inds]
test = df.iloc[test_inds]
这篇关于如何生成基于组ID的Train-Test-Split?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文