机器学习培训与学习测试数据分割方法 [英] Machine Learning Training & Test data split method

查看:148
本文介绍了机器学习培训与学习测试数据分割方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个随机森林分类模型,最初将数据分为训练(80%)和测试(20%).但是,该预测有太多的误报,我认为这是因为训练数据中存在太多干扰,所以我决定采用其他方法拆分数据,这就是我的处理方法.

I was running a random forest classification model and initially divided the data into train (80%) and test (20%). However, the prediction had too many False Positive which I think was because there was too much noise in training data, so I decided to split the data in a different method and here's how I did it.

由于我认为较高的误报是由于列车数据中的噪声引起的,所以我使列车数据具有相等数量的目标变量.例如,如果我有10,000行的数据,而目标变量是8,000(0)和2,000(1),则我的训练数据总共为4,000行,包括2,000(0)和2,000(1),因此训练数据现在有更多信号.

Since I thought the high False Positive was due to the noise in the train data, I made the train data to have the equal number of target variables. For example, if I have data of 10,000 rows and the target variable is 8,000 (0) and 2,000 (1), I had the training data to be a total of 4,000 rows including 2,000 (0) and 2,000 (1) so that the training data now have more signals.

当我尝试这种新的拆分方法时,它通过将召回率从14%增加到70%来更好地预测了方法.

When I tried this new splitting method, it predicted way better by increasing the Recall Positive from 14 % to 70%.

如果我在这里做错了什么,我很想听听您的反馈.我担心自己的训练数据是否有偏差.

I would love to hear your feedback if I am doing anything wrong here. I am concerned if I am making my training data biased.

推荐答案

当训练集中每个类中的数据点数量不相等时,基线(随机预测)就会改变.

When you have unequal number of data points in each classes in training set, the baseline (random prediction) changes.

通过嘈杂的数据,我想您想表示的是,第1类的训练点数量比其他数量更多.这并不是真正的噪音.这实际上是偏见.

By noisy data, I think you want to mean that number of training points for class 1 is more than other. This is not really called noise. It is actually bias.

例如:您在训练集中有10000个数据点,1类的8000个和0类的2000个.我可以一直预测0类,并且已经获得80%的准确性.这会引起偏差,0-1分类的基准不会是50%.

For ex: You have 10000 data point in training set, 8000 of class 1 and 2000 of class 0. I can predict class 0 all the time and get 80% accuracy already. This induces a bias and baseline for 0-1 classification will not be 50%.

要消除这种偏差,您可以像以前一样有意地平衡训练集,或者可以通过赋予权重与训练集中的点数成反比的方式来更改误差函数.

To remove this bias either you can intentionally balance the training set as you did or you can change the error function by giving weight inversely proportional to number of points in training set.

这篇关于机器学习培训与学习测试数据分割方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆