如何在sklearn RandomForestRegressor中正确预测? [英] How to predict correctly in sklearn RandomForestRegressor?

查看:691
本文介绍了如何在sklearn RandomForestRegressor中正确预测?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的学校项目开发一个大数据项目.我的数据集如下所示: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures. csv

我正在尝试预测"LandAverageTemperature"的下一个值.

首先,我将csv导入了pandas,并将其命名为"df1".

在第一次尝试sklearn时遇到错误后,我将"dt"列从字符串转换为datetime64,然后添加了一个名为"year"的列,该列仅显示日期值中的年份.-这可能是错误的-

df1["year"] = pd.DatetimeIndex(df1['dt']).year

所有这些之后,我准备了要进行回归的数据,并命名为RandomForestReggressor:

landAvg = df1[["LandAverageTemperature"]]
year = df1[["year"]]

from sklearn.ensemble import RandomForestRegressor

rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(year,landAvg.values.ravel())
print("Random forest:",rf_reg.predict(landAvg))

我运行了代码,然后看到了以下结果:

Random forest: [9.26558115 9.26558115 9.26558115 ... 9.26558115 9.26558115 9.26558115]

我没有收到任何错误,但我认为结果不正确-结果与您看到的相同.此外,当我想获得下一个10年的预测时,我不知道该怎么做.我只得到1结果与此代码.您可以帮助我改善代码并获得正确的结果吗? 在此先感谢您的帮助.

解决方案

仅使用年份来预测温度是不够的.您还需要使用月份数据.这是初学者的工作示例:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('https://raw.githubusercontent.com/gindeleo/climate/master/GlobalTemperatures.csv', usecols=['dt','LandAverageTemperature'], parse_dates=['dt'])
df = df.dropna()
df["year"] = df['dt'].dt.year
df["month"] = df['dt'].dt.month
X = df[["month", "year"]]
y = df["LandAverageTemperature"]
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(X, y)
y_pred = rf_reg.predict(X)
df_result = pd.DataFrame({'year': X['year'], 'month': X['month'], 'true': y, 'pred': y_pred})
print('True values and predictions')
print(df_result)
print('Feature importances', list(zip(X.columns, rf_reg.feature_importances_)))

这是输出:

True values and predictions
      year  month    true     pred
0     1750      1   3.034   2.2944
1     1750      2   3.083   2.4222
2     1750      3   5.626   5.6434
3     1750      4   8.490   8.3419
4     1750      5  11.573  11.7569
...    ...    ...     ...      ...
3187  2015      8  14.755  14.8004
3188  2015      9  12.999  13.0392
3189  2015     10  10.801  10.7068
3190  2015     11   7.433   7.1173
3191  2015     12   5.518   5.1634

[3180 rows x 4 columns]
Feature importances [('month', 0.9543059863177156), ('year', 0.045694013682284394)]

I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv

I'm trying to predict the next values of "LandAverageTemperature".

First, I've imported the csv into pandas and made it DataFrame named "df1".

After taking errors on my first tries in sklearn, I converted the "dt" column into datetime64 from string then added a column named "year" that shows only the years in the date values.-Its probably wrong-

df1["year"] = pd.DatetimeIndex(df1['dt']).year

After all of that, I prepared my data for reggression and called RandomForestReggressor:

landAvg = df1[["LandAverageTemperature"]]
year = df1[["year"]]

from sklearn.ensemble import RandomForestRegressor

rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(year,landAvg.values.ravel())
print("Random forest:",rf_reg.predict(landAvg))

I ran the code and I've seen this result:

Random forest: [9.26558115 9.26558115 9.26558115 ... 9.26558115 9.26558115 9.26558115]

I'm not getting any errors but I don't think the results are correct -results are all the same as you can see-. Besides, when I want to get next 10 year's predictions, I don't know how to do that. I just get 1 result with this code. Can you help me for improve my code and get the right results? Thanks in advance for your help.

解决方案

It's not enought to use only year to predict temperature. Your need to use month data too. Here is a working example for starters:

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('https://raw.githubusercontent.com/gindeleo/climate/master/GlobalTemperatures.csv', usecols=['dt','LandAverageTemperature'], parse_dates=['dt'])
df = df.dropna()
df["year"] = df['dt'].dt.year
df["month"] = df['dt'].dt.month
X = df[["month", "year"]]
y = df["LandAverageTemperature"]
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(X, y)
y_pred = rf_reg.predict(X)
df_result = pd.DataFrame({'year': X['year'], 'month': X['month'], 'true': y, 'pred': y_pred})
print('True values and predictions')
print(df_result)
print('Feature importances', list(zip(X.columns, rf_reg.feature_importances_)))

And here is output:

True values and predictions
      year  month    true     pred
0     1750      1   3.034   2.2944
1     1750      2   3.083   2.4222
2     1750      3   5.626   5.6434
3     1750      4   8.490   8.3419
4     1750      5  11.573  11.7569
...    ...    ...     ...      ...
3187  2015      8  14.755  14.8004
3188  2015      9  12.999  13.0392
3189  2015     10  10.801  10.7068
3190  2015     11   7.433   7.1173
3191  2015     12   5.518   5.1634

[3180 rows x 4 columns]
Feature importances [('month', 0.9543059863177156), ('year', 0.045694013682284394)]

这篇关于如何在sklearn RandomForestRegressor中正确预测?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆