如何将文档拆分为训练集和测试集? [英] How to split documents into training set and test set?

查看:70
本文介绍了如何将文档拆分为训练集和测试集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建分类模型.我在本地文件夹中有 1000 个文本文档.我想将它们分成训练集和测试集,分割比为 70:30(70 -> 训练和 30 -> 测试)什么是更好的方法?我正在使用 python.

I am trying to build a classification model. I have 1000 text documents in local folder. I want to divide them into training set and test set with a split ratio of 70:30(70 -> Training and 30 -> Test) What is the better approach to do so? I am using python.

我想要一种以编程方式拆分训练集和测试集的方法.首先读取本地目录中的文件.其次,构建这些文件的列表并对其进行洗牌.第三,将它们分成训练集和测试集.

I wanted a approach programatically to split the training set and test set. First to read the files in local directory. Second, to build a list of those files and shuffle them. Thirdly to split them into a training set and test set.

我尝试了几种方法,使用内置的 python 关键字和函数,但都失败了.最后我有了接近它的想法.交叉验证也是构建通用分类模型的一个很好的选择.

I tried a few ways by using built in python keywords and functions only to fail. Lastly I got the idea of approaching it. Also Cross-validation is a good option to be considered for the building general classification models.

推荐答案

不确定确切地你想要什么,所以我会尽量做到全面.将有几个步骤:

Not sure exactly what you're after, so I'll try to be comprehensive. There will be a few steps:

  1. 获取文件列表
  2. 随机化文件
  3. 将文件拆分为训练集和测试集
  4. 做事

1.获取文件列表

假设您的文件都具有扩展名 .data,并且它们都在文件夹 /ml/data/ 中.我们想要做的是获取所有这些文件的列表.这只需使用 os 模块即可完成.我假设您没有子目录;如果有的话,这会改变.

1. Get a list of the files

Let's assume that your files all have the extension .data and they're all in the folder /ml/data/. What we want to do is get a list of all of these files. This is done simply with the os module. I'm assuming you have no subdirectories; this would change if there were.

import os

def get_file_list_from_dir(datadir):
    all_files = os.listdir(os.path.abspath(datadir))
    data_files = list(filter(lambda file: file.endswith('.data'), all_files))
    return data_files

因此,如果我们调用 get_file_list_from_dir('/ml/data'),我们将返回该目录中所有 .data 文件的列表(相当于在 shell 中到 glob /ml/data/*.data).

So if we were to call get_file_list_from_dir('/ml/data'), we would get back a list of all the .data files in that directory (equivalent in the shell to the glob /ml/data/*.data).

我们不希望抽样是可预测的,因为这被认为是训练 ML 分类器的一种糟糕方式.

We don't want the sampling to be predictable, as that is considered a poor way to train an ML classifier.

from random import shuffle

def randomize_files(file_list):
    shuffle(file_list)

请注意,random.shuffle 执行 in-place shuffle,因此它会修改现有列表.(当然这个函数相当愚蠢,因为你可以只调用 shuffle 而不是 randomize_files;你可以把它写到另一个函数中以使其更有意义.)

Note that random.shuffle performs an in-place shuffling, so it modifies the existing list. (Of course this function is rather silly since you could just call shuffle instead of randomize_files; you can write this into another function to make it make more sense.)

我将假设比例为 70:30,而不是任何特定数量的文档.所以:

I'll assume a 70:30 ratio instead of any specific number of documents. So:

from math import floor

def get_training_and_testing_sets(file_list):
    split = 0.7
    split_index = floor(len(file_list) * split)
    training = file_list[:split_index]
    testing = file_list[split_index:]
    return training, testing

4.做事

这是打开每个文件并进行训练和测试的步骤.这个就交给你了!

4. Do the thing

This is the step where you open each file and do your training and testing. I'll leave this to you!

出于好奇,您是否考虑过使用交叉验证?这是一种拆分数据的方法,以便您使用每个文档进行训练和测试.您可以自定义每个折叠"中用于训练的文档数量.如果您愿意,我可以更深入地讨论这个问题,但如果您不想这样做,我不会这样做.

Out of curiosity, have you considered using cross-validation? This is a method of splitting your data so that you use every document for training and testing. You can customize how many documents are used for training in each "fold". I could go more into depth on this if you like, but I won't if you don't want to do it.

好的,既然你提出要求,我会再解释一下.

Alright, since you requested I will explain this a little bit more.

所以我们有一个包含 1000 个文档的数据集.交叉验证的想法是,您可以将其全部用于训练和测试——而不是一次.我们将数据集拆分为我们所说的折叠".折叠次数决定了任何给定时间点的训练集和测试集的大小.

So we have a 1000-document set of data. The idea of cross-validation is that you can use all of it for both training and testing — just not at once. We split the dataset into what we call "folds". The number of folds determines the size of the training and testing sets at any given point in time.

假设我们想要一个 10 折交叉验证系统.这意味着训练和测试算法将运行十次.第一次将在文档 1-100 上训练并在 101-1000 上测试.第二个折叠将在 101-200 上训练并在 1-100 和 201-1000 上测试.

Let's say we want a 10-fold cross-validation system. This means that the training and testing algorithms will run ten times. The first time will train on documents 1-100 and test on 101-1000. The second fold will train on 101-200 and test on 1-100 and 201-1000.

如果我们做一个 40 折的 CV 系统,第一折将在文档 1-25 上训练并在 26-1000 上测试,第二折将在 26-40 上训练并在 1-25 和 51 上测试-1000,等等.

If we did, say, a 40-fold CV system, the first fold would train on document 1-25 and test on 26-1000, the second fold would train on 26-40 and test on 1-25 and 51-1000, and on.

要实现这样的系统,我们仍然需要执行上面的步骤 (1) 和 (2),但步骤 (3) 会有所不同.我们可以将函数转换为 <,而不是分成两组(一组用于训练,一组用于测试)em>generator——一个我们可以像列表一样迭代的函数.

To implement such a system, we would still need to do steps (1) and (2) from above, but step (3) would be different. Instead of splitting into just two sets (one for training, one for testing), we could turn the function into a generator — a function which we can iterate through like a list.

def cross_validate(data_files, folds):
    if len(data_files) % folds != 0:
        raise ValueError(
            "invalid number of folds ({}) for the number of "
            "documents ({})".format(folds, len(data_files))
        )
    fold_size = len(data_files) // folds
    for split_index in range(0, len(data_files), fold_size):
        training = data_files[split_index:split_index + fold_size]
        testing = data_files[:split_index] + data_files[split_index + fold_size:]
        yield training, testing

最后的 yield 关键字使它成为生成器.要使用它,您可以像这样使用它:

That yield keyword at the end is what makes this a generator. To use it, you would use it like so:

def ml_function(datadir, num_folds):
    data_files = get_file_list_from_dir(datadir)
    randomize_files(data_files)
    for train_set, test_set in cross_validate(data_files, num_folds):
        do_ml_training(train_set)
        do_ml_testing(test_set)

同样,实现 ML 系统的实际功能取决于您.

Again, it's up to you to implement the actual functionality of your ML system.

作为免责声明,我绝不是专家,哈哈.但是,如果您对我在这里写的任何内容有任何疑问,请告诉我!

As a disclaimer, I'm no expert by any means, haha. But let me know if you have any questions about anything I've written here!

这篇关于如何将文档拆分为训练集和测试集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆