如何将数据分成 3 部分，其中之一不会被使用? [英] How to split data into 3 parts, one of which wont be used?

查看：45 发布时间：2021/7/16 20:18:52 scikit-learn

本文介绍了如何将数据分成 3 部分，其中之一不会被使用?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 csv，我想把 80% 分成训练，10% 分成开发测试，10% 分成测试集.开发测试不会被进一步使用.

我已经把它设置成:

导入sklearn导入 csvwith open('Letter.csv') as f:读者 = csv.reader(f)annotated_data = [r for r in reader]

和分裂:

随机导入随机种子(1234)random.shuffle(annotated_data)

但是我看到的所有拆分都只分成了 2 组，而且我看不到在哪里指定要拆分的分区数量，例如我想要 80% 的训练.也许我是盲人，但有人能帮帮我吗?我不知道如何使用熊猫.

此外，一旦我拆分了它，我该如何分别访问这些集合?例如，我可以将每个记录作为一个整体读取并计算条目的数量，但是一旦我拆分它，我想计算每个集合中有多少记录.对不起，如果这值得它自己的帖子，但我不想垃圾邮件.

解决方案

不，在 scikit-learn 中可以直接拆分为三个集合.典型的方法是两次拆分两次.在 80/20 中，然后将 20% 拆分为 50/50.您想检查 train_test_split-函数.

本质上，带有数据 X 和 y 的代码可能如下所示:

将 numpy 导入为 np从 sklearn.model_selection 导入 train_test_splitX, y = np.arange(100).reshape((5, 2)), range(5)X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)

现在你想要使用 (X_train, y_train)、(X_dev, y_dev) 和 (X_test, y_test)>

I've got a csv that I want to split 80% into training, 10% into dev-test and 10% into test set. The dev-test wont be used further.

I've got it set up like:

import sklearn
import csv
with open('Letter.csv') as f:
   reader = csv.reader(f)
   annotated_data = [r for r in reader]

and for splitting:

import random  
random.seed(1234)  
random.shuffle(annotated_data)

But all the splitting I've seen only slips into 2 sets, and I can't see where to specify how much partition to split it with, eg I want 80% training. Maybe I'm blind, but can anyone help me? I don't know how to use pandas.

Also once I split it, how do I access the sets separately? For eg I can read each record as a whole and count the amount of entries, but once I split it I want to count how many records are in each set. Sorry if this deserves its own post, but I don't want to spam.

解决方案

No, it's possible in scikit-learn to split into three sets directly. The typical approach is two split twice.in 80/20 and then split the 20 percent 50/50. You want to check the train_test_split-function.

Essentially, the code with data X and y could look like this:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(100).reshape((5, 2)), range(5)

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.2)
X_dev, X_test, y_dev, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5)

Now you would want to work with (X_train, y_train), (X_dev, y_dev) and (X_test, y_test)

这篇关于如何将数据分成 3 部分，其中之一不会被使用?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将数据分成 3 部分，其中之一不会被使用? [英] How to split data into 3 parts, one of which wont be used?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将数据分成 3 部分，其中之一不会被使用? [英] How to split data into 3 parts, one of which wont be used?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭