如何找到将数据拆分为测试和训练的最佳值? [英] How to find the optimal values for splitting the data into test and train?

查看:49
本文介绍了如何找到将数据拆分为测试和训练的最佳值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个 python 应用程序,我想在其中预测一个月内 PM2.5 的值.我正在使用多项式回归,并训练了算法将数据拆分为 30% 的测试数据和 70% 的训练数据.我正在使用这行代码来训练算法:

I am building a python application in which i want to forecast the values of PM2.5 over a month. I am using polynomial regression and I have trained the algorithm to split data into 30%test data and 70%train data. I am using this line of code to train the algorithm:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42,shuffle=True)

但我注意到,如果我给 random_state 不同的整数,则均方误差和预测的准确性也会不同.如何找到 train_test_split 方法的最佳参数以使预测具有最高准确度?

But i have noticed that if i give the random_state different integers, the mean squared error differs and also the accuracy of the forecast. How can I find the optimal parameters for the train_test_split method so that the forecast has the most accuracy?

推荐答案

当你改变随机种子时,准确度有多大变化?

How much does the accuracy vary when you change the random seed?

您可以使用 k 折交叉验证来找到最佳分割,但是,我不确定您是否想要具有最高准确度的分割.您希望您的模型能够泛化.您应该选择拥有足够训练数据和代表模型将遇到的真实世界测试数据的测试集.

You can use k-fold cross-validation to find the best split, however, I am not sure you want the one with the highest accuracy. You want your model to generalize. You should go for the one where you have enough training data and a test set that is representative of the real-world test data the model will encounter.

这篇关于如何找到将数据拆分为测试和训练的最佳值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆