使用TRAIN_TEST_SPLIT拆分数据时的精度与之后加载CSV文件时的精度不同 [英] Different accuracy when splitting data with train_test_split than loading csv file afterwards

查看：0 发布时间：2022/8/4 18:44:00 python tensorflow machine-learning keras classification

本文介绍了使用TRAIN_TEST_SPLIT拆分数据时的精度与之后加载CSV文件时的精度不同的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经构建了一个模型来预测客户是企业客户还是私人客户。在对模型进行训练后，我预测了我没有用于训练的1000个数据集的类别。此预测将保存在CSV文件中。现在我有两种不同的行为：

在程序中拆分样本数据

当我使用train, sample = train_test_split(train, test_size=1000, random_state=seed)创建样本时，预测在训练期间获得相同的准确性(与验证相同的值)。

预先拆分样本数据，然后加载

但是，当我在学习前手动拆分数据时，通过获取原始CSV文件的1000个数据集，并将其复制到我在学习后进行预测之前加载的新样本CSV文件中，我得到的结果要差得多(例如，76%而不是90%)。这种行为在我看来是没有意义的，因为原始数据(用于训练的CSV文件)也是高级的，因此我应该得到同样的结果。以下是上述区分大小写的相关代码：

1.在程序中拆分样本数据

拆分

def getPreProcessedDatasetsWithSamples(filepath, batch_size):
    path = filepath
    data = __getPreprocessedDataFromPath(path) 
    
    train, test = train_test_split(data, test_size=0.2, random_state=42)
    train, val = train_test_split(train, test_size=0.2, random_state=42)
    train, sample = train_test_split(train, test_size=1000, random_state=seed)

    train_ds = __df_to_dataset(train, shuffle=False, batch_size=batch_size)
    val_ds = __df_to_dataset(val, shuffle=False, batch_size=batch_size)
    test_ds = __df_to_dataset(test, shuffle=False, batch_size=batch_size)
    sample_ds = __df_to_dataset(sample, shuffle=False, batch_size=batch_size)

    return (train_ds, val_ds, test_ds, sample, sample_ds)

使用样本进行预测，Sample_DS

def savePredictionWithSampleToFileKeras(model, outputName, sample, sample_ds):
    predictions = model.predict(sample_ds)
    loss, accuracy = model.evaluate(sample_ds)


    print("Accuracy of sample", accuracy)


    sample['prediction'] = predictions
    sample.to_csv("./saved_samples/" + outputName + ".csv")

样本准确率：90%

2.预先拆分样本数据，然后加载

通过加载CSV文件进行预测

def savePredictionToFileKeras(model, sampleFilePath, outputName, batch_size):
    sample_ds = preprocessing.getPreProcessedSampleDataSets(sampleFilePath, batch_size)
    sample = preprocessing.getPreProcessedSampleDataFrames(sampleFilePath)

    predictions = model.predict(sample_ds)
    loss, accuracy = model.evaluate(sample_ds)

    print("Accuracy of sample", accuracy)

    sample['prediction'] = predictions
    sample.to_csv("./saved_samples/" + outputName + ".csv")

样本准确率：77%

编辑

观察：当我将整个数据作为样本数据加载时，我得到的值与验证值相同(大约90%)，但当我只是随机化同一文件的行顺序时，我得到的值是82%。根据我的理解，因为文件是相等的，所以精度应该是相同的。

一些其他信息：我已经将实现从顺序API更改为函数式API。我在前处理中使用了嵌入(我还尝试了一个热编码，但没有成功)。

使用TRAIN_TEST_SPLIT拆分数据时的精度与之后加载CSV文件时的精度不同 [英] Different accuracy when splitting data with train_test_split than loading csv file afterwards

问题描述

1.在程序中拆分样本数据

2.预先拆分样本数据，然后加载

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

使用TRAIN_TEST_SPLIT拆分数据时的精度与之后加载CSV文件时的精度不同 [英] Different accuracy when splitting data with train_test_split than loading csv file afterwards

问题描述

1.在程序中拆分样本数据

2.预先拆分样本数据，然后加载

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭