如何将数据分成 3 部分,其中之一不会被使用? [英] How to split data into 3 parts, one of which wont be used?

查看:45
本文介绍了如何将数据分成 3 部分,其中之一不会被使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 csv,我想把 80% 分成训练,10% 分成开发测试,10% 分成测试集.开发测试不会被进一步使用.

我已经把它设置成:

导入sklearn导入 csvwith open('Letter.csv') as f:读者 = csv.reader(f)annotated_data = [r for r in reader]

和分裂:

随机导入随机种子(1234)random.shuffle(annotated_data)

但是我看到的所有拆分都只分成了 2 组,而且我看不到在哪里指定要拆分的分区数量,例如我想要 80% 的训练.也许我是盲人,但有人能帮帮我吗?我不知道如何使用熊猫.

此外,一旦我拆分了它,我该如何分别访问这些集合?例如,我可以将每个记录作为一个整体读取并计算条目的数量,但是一旦我拆分它,我想计算每个集合中有多少记录.对不起,如果这值得它自己的帖子,但我不想垃圾邮件.

解决方案

不,在 scikit-learn 中可以直接拆分为三个集合.典型的方法是两次拆分两次.在 80/20 中,然后将 20% 拆分为 50/50.您想检查 train_test_split-函数.

本质上,带有数据 Xy 的代码可能如下所示:

将 numpy 导入为 np从 sklearn.model_selection 导入 train_test_splitX, y = np.arange(100).reshape((5, 2)), range(5)X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)

现在你想要使用 (X_train, y_train)(X_dev, y_dev)(X_test, y_test)>

I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.

I've got it set up like:

import sklearn
import csv
with open('Letter.csv') as f:
   reader = csv.reader(f)
   annotated_data = [r for r in reader]

and for splitting:

import random  
random.seed(1234)  
random.shuffle(annotated_data)

But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.

Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.

解决方案

No, it's possible in scikit-learn to split into three sets directly. The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split-function.

Essentially, the code with data X and y could look like this:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)

Now you would want to work with (X_train, y_train), (X_dev, y_dev) and (X_test, y_test)

这篇关于如何将数据分成 3 部分,其中之一不会被使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆