如何在sklearn RandomForestRegressor中正确预测? [英] How to predict correctly in sklearn RandomForestRegressor?
问题描述
我正在为我的学校项目开发一个大数据项目.我的数据集如下所示: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures. csv
我正在尝试预测"LandAverageTemperature"的下一个值.
首先,我将csv导入了pandas,并将其命名为"df1".
在第一次尝试sklearn时遇到错误后,我将"dt"列从字符串转换为datetime64,然后添加了一个名为"year"的列,该列仅显示日期值中的年份.-这可能是错误的-
df1["year"] = pd.DatetimeIndex(df1['dt']).year
所有这些之后,我准备了要进行回归的数据,并命名为RandomForestReggressor:
landAvg = df1[["LandAverageTemperature"]]
year = df1[["year"]]
from sklearn.ensemble import RandomForestRegressor
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(year,landAvg.values.ravel())
print("Random forest:",rf_reg.predict(landAvg))
我运行了代码,然后看到了以下结果:
Random forest: [9.26558115 9.26558115 9.26558115 ... 9.26558115 9.26558115 9.26558115]
我没有收到任何错误,但我认为结果不正确-结果与您看到的相同.此外,当我想获得下一个10年的预测时,我不知道该怎么做.我只得到1结果与此代码.您可以帮助我改善代码并获得正确的结果吗? 在此先感谢您的帮助.
仅使用年份来预测温度是不够的.您还需要使用月份数据.这是初学者的工作示例:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('https://raw.githubusercontent.com/gindeleo/climate/master/GlobalTemperatures.csv', usecols=['dt','LandAverageTemperature'], parse_dates=['dt'])
df = df.dropna()
df["year"] = df['dt'].dt.year
df["month"] = df['dt'].dt.month
X = df[["month", "year"]]
y = df["LandAverageTemperature"]
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(X, y)
y_pred = rf_reg.predict(X)
df_result = pd.DataFrame({'year': X['year'], 'month': X['month'], 'true': y, 'pred': y_pred})
print('True values and predictions')
print(df_result)
print('Feature importances', list(zip(X.columns, rf_reg.feature_importances_)))
这是输出:
True values and predictions
year month true pred
0 1750 1 3.034 2.2944
1 1750 2 3.083 2.4222
2 1750 3 5.626 5.6434
3 1750 4 8.490 8.3419
4 1750 5 11.573 11.7569
... ... ... ... ...
3187 2015 8 14.755 14.8004
3188 2015 9 12.999 13.0392
3189 2015 10 10.801 10.7068
3190 2015 11 7.433 7.1173
3191 2015 12 5.518 5.1634
[3180 rows x 4 columns]
Feature importances [('month', 0.9543059863177156), ('year', 0.045694013682284394)]
I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv
I'm trying to predict the next values of "LandAverageTemperature".
First, I've imported the csv into pandas and made it DataFrame named "df1".
After taking errors on my first tries in sklearn, I converted the "dt" column into datetime64 from string then added a column named "year" that shows only the years in the date values.-Its probably wrong-
df1["year"] = pd.DatetimeIndex(df1['dt']).year
After all of that, I prepared my data for reggression and called RandomForestReggressor:
landAvg = df1[["LandAverageTemperature"]]
year = df1[["year"]]
from sklearn.ensemble import RandomForestRegressor
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(year,landAvg.values.ravel())
print("Random forest:",rf_reg.predict(landAvg))
I ran the code and I've seen this result:
Random forest: [9.26558115 9.26558115 9.26558115 ... 9.26558115 9.26558115 9.26558115]
I'm not getting any errors but I don't think the results are correct -results are all the same as you can see-. Besides, when I want to get next 10 year's predictions, I don't know how to do that. I just get 1 result with this code. Can you help me for improve my code and get the right results? Thanks in advance for your help.
It's not enought to use only year to predict temperature. Your need to use month data too. Here is a working example for starters:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv('https://raw.githubusercontent.com/gindeleo/climate/master/GlobalTemperatures.csv', usecols=['dt','LandAverageTemperature'], parse_dates=['dt'])
df = df.dropna()
df["year"] = df['dt'].dt.year
df["month"] = df['dt'].dt.month
X = df[["month", "year"]]
y = df["LandAverageTemperature"]
rf_reg=RandomForestRegressor(n_estimators=10,random_state=0)
rf_reg.fit(X, y)
y_pred = rf_reg.predict(X)
df_result = pd.DataFrame({'year': X['year'], 'month': X['month'], 'true': y, 'pred': y_pred})
print('True values and predictions')
print(df_result)
print('Feature importances', list(zip(X.columns, rf_reg.feature_importances_)))
And here is output:
True values and predictions
year month true pred
0 1750 1 3.034 2.2944
1 1750 2 3.083 2.4222
2 1750 3 5.626 5.6434
3 1750 4 8.490 8.3419
4 1750 5 11.573 11.7569
... ... ... ... ...
3187 2015 8 14.755 14.8004
3188 2015 9 12.999 13.0392
3189 2015 10 10.801 10.7068
3190 2015 11 7.433 7.1173
3191 2015 12 5.518 5.1634
[3180 rows x 4 columns]
Feature importances [('month', 0.9543059863177156), ('year', 0.045694013682284394)]
这篇关于如何在sklearn RandomForestRegressor中正确预测?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!