使用随机森林预测未来事件 [英] Forecasting future occurrences with Random Forest

查看:197
本文介绍了使用随机森林预测未来事件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在探索使用随机森林来预测未来发生的值(我的ARIMA模型给我的预测确实很差,所以我正在尝试评估其他选项).我完全意识到,糟糕的结果可能是由于我没有很多数据并且质量不是最高的事实所致.我的初始数据仅包含每个日期的出现次数.然后,我添加了代表星期,日,月,年,周几的单独列(此后进行了一次热编码),然后我还添加了两列具有滞后值的列(其中一列具有前一天观察到的值,另一列则具有前一天的值)与前两天观察到的值).最终数据如下:

I'm currently exploring the use of Random Forests to predict future values of occurrences (my ARIMA model gave me really bad forecasting so I'm trying to evaluate other options). I'm fully aware that the bad results might be due to the fact that I don't have a lot of data and the quality isn't the greatest. My initial data consisted simply of the number of occurrences per date. I then added separate columns representing the day, month, year, day of the week (which was later one-hot encoded) and then I also added two columns with lagged values (one of them with the value observed in the day before and another with the value observed two days before). The final data is like this:

Count   Year    Month   Day Count-1 Count-2 Friday  Monday  Saturday Sunday Thursday Tuesday Wednesday
196.0   2017.0  7.0    10.0 196.0   196.0     0       1        0       0       0     0        0
264.0   2017.0  7.0    11.0 196.0   196.0     0       0        0       0       0     1        0
274.0   2017.0  7.0    12.0 264.0   196.0     0       0        0       0       0     0        1
286.0   2017.0  7.0    13.0 274.0   264.0     0       0        0       0       1     0        0
502.0   2017.0  7.0    14.0 286.0   274.0     1       0        0       0       0     0        0
... ... ... ... ... ... ... ... ... ... ... ... ... 

然后,我训练了一个随机森林,对标签(我要预测的)和其余所有功能进行计数.我还进行了70/30的训练/测试拆分.在火车数据上对其进行训练,然后使用测试集评估模型(以下代码):

I then trained a random forest making the count the label (what I'm trying to predict) and all the rest the features. I also made 70/30 train/test split. Trained it on the train data and then used the test set to evaluate the model (code below):

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)

predictions = rf.predict(test_features)

我获得的结果非常好:MAE = 1.71,准确度为89.84%.

The results I obtained were pretty good: MAE=1.71 and Accuracy of 89.84%.

第一个问题:我有可能疯狂地过度拟合数据吗?我只是想确保我没有犯一些大错误,该错误会给我带来比应有的更好的结果.

First question: is there any possibility that I'm crazily overfitting the data? I just want to make sure I'm not making some big mistake that's giving me better results than I should get.

第二个问题:经过训练的模型,如何使用RF预测未来价值?我的目标是每周对发生的次数进行预测,但我对如何做到这一点有些固执.

Second question: with the model trained, how do I use RF to predict future values? My goal was to give weekly forecasts for the number occurrences but I'm kind of stuck on how to do that.

如果在这个方面比我更好一点和更有经验的人可以帮助我,我将非常感激!谢谢

If some who's a bit better and more experienced than me at this could help, I'd be very much appreciated! Thanks

推荐答案

解决您的第一个问题,随机森林可能会过拟合,但是在比较测试集的MAE,MSE和RMSE时应进行检查.您是什么意思?您的R平方?但是,使用模型的方法通常是首先使其过拟合,因此您具有不错的准确性/mse/rmse,然后通过设置高min_child_weight或低max_depth来执行正则化技术来处理这种过拟合,高n_estimators也很好.

Adressing your first question, random forest might tend to overfit, but that should be checked when comparing the MAE, MSE, RMSE of your test set. What do you mean with accuracy? Your R square? However, the way to work with models is to usually make them overfit at first, so you have a decent accuracy/mse/rmse and later perform regularization techniques to deal with this overfitting by setting a high min_child_weight or low max_depth, a high n_estimators is also good.

第二,要使用模型预测未来价值,您需要使用完全相同的训练模型以及要进行预测的数据集.当然,训练中给出的特征必须与进行预测时要给出的输入相匹配.此外,请记住,随着时间的流逝,通过将新信息添加到您的train数据集中,这些新信息对于改进模型将非常有价值.

Secondly, to use your model to predict future values, you need to use the exact same model you trained, with the dataset you want to make your prediction on. Of course the features that were given in train must match the inputs that will be given when doing the forecasting. Furthermore, keep in mind that as time passes, this new information will be very valuable to improve your model by adding this new information to your train dataset.

forecasting = rf.predict(dataset_to_be_forecasted)

这篇关于使用随机森林预测未来事件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆