Python 数据框:每次运行随机森林回归器时的 RMSE 分数不同 [英] Python Dataframe: Different RMSE Score Every Time I Run Random Forest Regressor

查看:352
本文介绍了Python 数据框:每次运行随机森林回归器时的 RMSE 分数不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前使用以下代码运行随机森林模型.我将 random_state 设置为 100.

I currently run a random forest model using the following code. I set a random_state equal to 100.

from sklearn.cross_validation import train_test_split

X_train_RIA_INST_PWM, X_test_RIA_INST_PWM, y_train_RIA_INST_PWM, y_test_RIA_INST_PWM = train_test_split(X_RIA_INST_PWM, Y_RIA_INST_PWM, test_size=0.3, random_state = 100)



# Random Forest Regressor for RIA_INST_PWM accounts  

import numpy as np
from sklearn.ensemble import RandomForestRegressor

regressor_RIA_INST_PWM = RandomForestRegressor(n_estimators=100, min_samples_split = 10)
regressor_RIA_INST_PWM.fit(X_RIA_INST_PWM, Y_RIA_INST_PWM)

print ("R^2 for training set:"),
print (regressor_RIA_INST_PWM.score(X_train_RIA_INST_PWM, y_train_RIA_INST_PWM))

print ('-'*50)

print ("R^2 for test set:"),
print (regressor_RIA_INST_PWM.score(X_test_RIA_INST_PWM, y_test_RIA_INST_PWM))

然后我使用以下代码来计算预测值.

And then I use the following code to calculate the prediction values.

def predict_AUM(df, features, regressor):

    # Reset index for later merge of predicted target values with Account IDs
    df.reset_index();

    # Set predictor variables 
    X_Predict = df[features]

    # Clean inputs 
    X_Predict = X_Predict.replace([np.inf, -np.inf], np.nan)
    X_Predict = X_Predict.fillna(0)

    # Predict Current_AUM
    Y_AUM_Snapshot_1yr_Predict = regressor.predict(X_Predict)
    df['PREDICTED_SPAN'] = Y_AUM_Snapshot_1yr_Predict

    return df 

df_EVENT5_20 = predict_AUM(df_EVENT5_19, dfzip_features_AUM_RIA_INST_PWM, regressor_RIA_INST_PWM)

最后,我计算结果的 RMSE:

Finally, I calculate the RMSE of my results:

from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(df_EVENT5_20['SPAN_DAYS'], df_EVENT5_20['PREDICTED_SPAN']))
rmse

每次我运行我的代码时……我的 RMSE 都会改变.它已经从 7.75 变化到 16.4 为什么会发生这种情况?每次运行代码时如何获得相同的 RMSE?此外,我如何针对 RMSE 优化我的模型?

Each time I run my code ... my RMSE changes. It has varied from 7.75 to 16.4 Why is this happening? And how can I have the same RMSE each time I run the code? Additionally, how do I optimize my model for RMSE?

推荐答案

您只播种了 train_test_split 以确保随机分配到训练和测试集的数据是可重现的.

You only seeded the train_test_split which makes sure that the random allocation of the data to train and test set is reproducible.

顾名思义RandomForestRegressor 还包含算法中依赖随机数的部分(例如,特别是数据的不同部分或用于训练单个决策树的不同特征).如果您想要可重现的结果,您还需要播种.为此,您需要像这样使用 random_state 初始化它:

As the name suggests RandomForestRegressor also contains parts in the algorithm that rely on random numbers (e.g., specifically different parts of data or different features for training individual decision trees). If you want reproducible results, you need to seed it as well. For that you need to initilize it with random_state like that:

regressor_RIA_INST_PWM = RandomForestRegressor(
                           n_estimators=100, 
                           min_samples_split = 10, 
                           random_state=100
                         )

这篇关于Python 数据框:每次运行随机森林回归器时的 RMSE 分数不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆