如何将数据分成 3 部分,其中之一不会被使用? [英] How to split data into 3 parts, one of which wont be used?
问题描述
我有一个 csv,我想把 80% 分成训练,10% 分成开发测试,10% 分成测试集.开发测试不会被进一步使用.
我已经把它设置成:
导入sklearn导入 csvwith open('Letter.csv') as f:读者 = csv.reader(f)annotated_data = [r for r in reader]
和分裂:
随机导入随机种子(1234)random.shuffle(annotated_data)
但是我看到的所有拆分都只分成了 2 组,而且我看不到在哪里指定要拆分的分区数量,例如我想要 80% 的训练.也许我是盲人,但有人能帮帮我吗?我不知道如何使用熊猫.
此外,一旦我拆分了它,我该如何分别访问这些集合?例如,我可以将每个记录作为一个整体读取并计算条目的数量,但是一旦我拆分它,我想计算每个集合中有多少记录.对不起,如果这值得它自己的帖子,但我不想垃圾邮件.
不,在 scikit-learn 中可以直接拆分为三个集合.典型的方法是两次拆分两次.在 80/20 中,然后将 20% 拆分为 50/50.您想检查 train_test_split
-函数.
本质上,带有数据 X
和 y
的代码可能如下所示:
将 numpy 导入为 np从 sklearn.model_selection 导入 train_test_splitX, y = np.arange(100).reshape((5, 2)), range(5)X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)
现在你想要使用 (X_train, y_train)
、(X_dev, y_dev)
和 (X_test, y_test)
>
I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.
I've got it set up like:
import sklearn
import csv
with open('Letter.csv') as f:
reader = csv.reader(f)
annotated_data = [r for r in reader]
and for splitting:
import random
random.seed(1234)
random.shuffle(annotated_data)
But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.
Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.
No, it's possible in scikit-learn to split into three sets directly.
The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split
-function.
Essentially, the code with data X
and y
could look like this:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)
Now you would want to work with (X_train, y_train)
, (X_dev, y_dev)
and (X_test, y_test)
这篇关于如何将数据分成 3 部分,其中之一不会被使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!