将数据分为培训和测试 [英] split data into training and testing

查看:61
本文介绍了将数据分为培训和测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想复制本教程以对两组进行分类.> https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/,但使用不同的数据集却无法做到,尽管很难尝试.我是编程新手,不胜感激可以提供帮助或提示的信息.

I want to replicate this tutorial to classify two groups https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/ with different dataset but could not do that despite being hardly trying. I am new to programming so would appreciate any assistance or tips that could help.

我的数据集很小(每组240个文件),文件名为01-0240.

My dataset is small (240 files for each group), and files named 01 - 0240.

我认为这是围绕这些代码行的.

It is around these lines of codes, I think.

    if is_trian and filename.startswith('cv9'):
        continue
    if not is_trian and not filename.startswith('cv9'):
        continue

还有这些

            trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
            save_dataset([trainX,trainy], 'train.pkl')

            testY = [0 for _ in range(100)] + [1 for _ in range(100)]
            save_dataset([testX,testY], 'test.pkl')

到目前为止遇到两个错误:

two errors were encountered so far:

输入数组应具有与目标数组相同数量的样本.找到483个输入样本和200个目标样本.

Input arrays should have the same number of samples as target arrays. Found 483 input samples and 200 target samples.

无法打开文件(无法打开文件:name ='model.h5',errno =2,错误消息='没有这样的文件或目录',标志= 0,o_flags =0)

Unable to open file (unable to open file: name = 'model.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

我非常感谢您的及时帮助.

I would really appreciate any prompt help.

谢谢.

//部分代码更加清晰.//

// Part of the code for more clarity. //

# load all docs in a directory
def process_docs(directory, is_trian):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any transcript in the test set

正如教程中所提到的,我想在下面添加一个参数来指示是处理培训文件还是测试文件.或者如果有另一个方式请分享

I want to add an argument below to indicate whether to process the training or testing files, just as mentioned in the tutorial. Or if there's another way please share it

        if is_trian and filename.startswith('----'):
            continue
        if not is_trian and not filename.startswith('----'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc)
        # add to list
        documents.append(tokens)
    return documents

# save a dataset to file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')

# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]

save_dataset([testX,testY], 'test.pkl')

推荐答案

我能够通过手动将数据集分为训练集和测试集,然后单独标记每个集来解决该问题.我当前的数据集很小,因此一旦有能力,我将继续为大型数据集寻找更好的解决方案.提供以结束问题.

I was able to solve the problem by separating the dataset into train and test sets manually and then labelling each set alone. My current dataset is so small, so I will keep looking for a better solution for large datasets once I have the capacity. Provided to close the question.

这篇关于将数据分为培训和测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆