使用Sklearn指标的 pandas 相关误差 [英] Pandas Correlation Error Using Sklearn Metrics

查看:56
本文介绍了使用Sklearn指标的 pandas 相关误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用大熊猫计算大型数据集的r2或r平方,并在诸如"data1"之类的数据框中按plant_name和month对数据进行分组.如下所示.问题是,当我使用sklearn度量标准和定义的函数时,我得到的结果与我使用"data1"中的相同数据获得的结果不一致.在Excel中.这是"data1"中的数据

  plant_name月份年wind_speed_obs wind_speed_ms0大喇叭I 1 2018 5.143830 6.0124361大喇叭I 1 2019 4.556545 5.2318552大喇叭I 1 2020 6.582890 7.8665323大喇叭I 2 2018 7.904438 9.2488104大喇叭I 2 2019 4.353567 5.1156255大喇叭I 2 2020 7.376739 8.4080466大喇叭I 3 2018 6.138197 6.9220437大喇叭I 3 2019 3.881804 4.4842748大喇叭I 3 2020 7.071029 7.3471779大喇叭I 4 2018 7.106936 7.69986110大喇叭I 4 2019 6.874942 7.57527811大喇叭I 4 2020 6.855979 7.10625012大喇叭I 5 2018 5.366054 6.51075313大喇叭I 5 2019 5.657342 6.59758114大喇叭I 5 2020 7.010745 7.24704315大喇叭I 6 2018 6.399417 7.07652816大喇叭I 6 2019 6.578241 7.55611117大喇叭I 6 2020 7.120105 7.54819418大喇叭I 7 2018 5.615110 6.12392519大喇叭I 7 2019 6.212104 6.96344120大喇叭I 7 2020 6.663250 6.97231221大喇叭I 8 2018 5.303967 5.94731222大喇叭I 8 2019 5.176691 6.20927423大喇叭I 8 2020 6.093748 6.33763424大喇叭I 9 2018 5.375531 5.87847225大喇叭I 9 2019 6.126961 6.79250026大喇叭I 9 2020 5.608530 6.02805627大喇叭I 10 2018 4.466079 5.05497328大喇叭I 10 2019 5.492795 6.32607529大喇叭I 10 2020 7.103278 7.49207030大喇叭I 11 2018 5.341987 5.88902831大喇叭I 11 2019 4.887397 5.14402832大喇叭I 11 2020 6.718649 7.15000033大喇叭I 12 2018 5.099386 5.86693534大喇叭I 12 2019 3.925717 4.23414035大喇叭I 12 2020 5.589325 5.943145 

这是我正在使用的代码:

从sklearn.metrics中的

 导入r2_scoredef r2_rmse2(g):r2 = r2_score(g ['wind_speed_obs'],g ['wind_speed_ms'])#rmse = np.sqrt(mean_squared_error(g ['wind_speed_obs'],g ['wind_speed_ms'])))返回pd.Series(dict(r2 = r2))data1.groupby(['plant_name','month']).apply(r2_rmse2).reset_index() 

在上面应用r2_rmse2函数时,我得到了以下结果:

  plant_name月r20大喇叭我1 -0.3147711大喇叭I 2 0.5298902大喇叭I 3 0.8040663大喇叭I 4 -22.1647204大喇叭I 5 -0.4606905大喇叭I 6 -4.6733596大喇叭I 7 -0.6621667大喇叭I 8 -2.1188158大喇叭我9 -1.9465669大喇叭I 10 0.66263610大喇叭I 11 0.69689611大喇叭I 12 0.446235 

当我在Excel中测试应使用该功能的功能时,正确的结果是:

  plant_name月r2大喇叭I 1 0.999975202大喇叭I 2 0.998459857大喇叭I 3 0.988712352大喇叭我4 0.711649414大喇叭I 5 0.998282523大喇叭I 6 0.681460011大喇叭我7 0.907152074大喇叭我8 0.66212225大喇叭我9 0.98807953大喇叭I 10 0.988469127大喇叭I 11 0.990836283大喇叭I 12 0.968629237 

我不明白为什么使用功能的申请不正确.谢谢您的帮助.

解决方案

以下是R平方,RMSE和 导入r2_score,mean_squared_error从scipy.stats导入pearsonrdef r2_rmse2(g):r2 = r2_score(g ['wind_speed_obs'],g ['wind_speed_ms'])rmse = mean_squared_error(g ['wind_speed_obs'],g ['wind_speed_ms'],squared = False)correl = pearsonr(g ['wind_speed_obs'],g ['wind_speed_ms'])[0]返回pd.Series(dict(r2 = r2,rmse = rmse,correl = correl))data1.groupby(['plant_name','month']).apply(r2_rmse2).reset_index()

  plant_name month r2 rmse correl0大喇叭I 1 -0.314771 0.976090 0.9999751大喇叭I 2 0.529890 1.072639 0.9984602大喇叭I 3 0.804066 0.592633 0.9887123大喇叭I 4 -22.164844 0.549141 0.7116494大喇叭I 5 -0.460691 0.866068 0.9982835大喇叭I 6 -4.673359 0.729833 0.6814606大喇叭I 7 -0.662167 0.553450 0.9071527大喇叭I 8 -2.118817 0.716380 0.6621228大喇叭I 9 -1.946562 0.539102 0.9880809大喇叭I 10 0.662637 0.630426 0.98846910大喇叭I 11 0.696896 0.428632 0.99083611大喇叭I 12 0.446234 0.519437 0.968629 

I am trying to calculate r2 or r-squared over a large dataset with pandas and grouping the data by plant_name and month in a dataframe like "data1" shown below. The problem is that when I use sklearn metrics and a defined function, I obtain a result that is not consistent with a result that I obtain using the same data in "data1" in Excel. Here is the data in "data1"

    plant_name  month  year  wind_speed_obs  wind_speed_ms
0   BIG HORN I      1  2018        5.143830       6.012436
1   BIG HORN I      1  2019        4.556545       5.231855
2   BIG HORN I      1  2020        6.582890       7.866532
3   BIG HORN I      2  2018        7.904438       9.248810
4   BIG HORN I      2  2019        4.353567       5.115625
5   BIG HORN I      2  2020        7.376739       8.408046
6   BIG HORN I      3  2018        6.138197       6.922043
7   BIG HORN I      3  2019        3.881804       4.484274
8   BIG HORN I      3  2020        7.071029       7.347177
9   BIG HORN I      4  2018        7.106936       7.699861
10  BIG HORN I      4  2019        6.874942       7.575278
11  BIG HORN I      4  2020        6.855979       7.106250
12  BIG HORN I      5  2018        5.366054       6.510753
13  BIG HORN I      5  2019        5.657342       6.597581
14  BIG HORN I      5  2020        7.010745       7.247043
15  BIG HORN I      6  2018        6.399417       7.076528
16  BIG HORN I      6  2019        6.578241       7.556111
17  BIG HORN I      6  2020        7.120105       7.548194
18  BIG HORN I      7  2018        5.615110       6.123925
19  BIG HORN I      7  2019        6.212104       6.963441
20  BIG HORN I      7  2020        6.663250       6.972312
21  BIG HORN I      8  2018        5.303967       5.947312
22  BIG HORN I      8  2019        5.176691       6.209274
23  BIG HORN I      8  2020        6.093748       6.337634
24  BIG HORN I      9  2018        5.375531       5.878472
25  BIG HORN I      9  2019        6.126961       6.792500
26  BIG HORN I      9  2020        5.608530       6.028056
27  BIG HORN I     10  2018        4.466079       5.054973
28  BIG HORN I     10  2019        5.492795       6.326075
29  BIG HORN I     10  2020        7.103278       7.492070
30  BIG HORN I     11  2018        5.341987       5.889028
31  BIG HORN I     11  2019        4.887397       5.144028
32  BIG HORN I     11  2020        6.718649       7.150000
33  BIG HORN I     12  2018        5.099386       5.866935
34  BIG HORN I     12  2019        3.925717       4.234140
35  BIG HORN I     12  2020        5.589325       5.943145

Here is the code i'm using:

from sklearn.metrics import r2_score
def r2_rmse2( g ):
    r2 = r2_score( g['wind_speed_obs'], g['wind_speed_ms'] )
    #rmse = np.sqrt( mean_squared_error( g['wind_speed_obs'], g['wind_speed_ms'] ) )
    return pd.Series( dict(  r2 = r2 ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()

I obtain this result when applying the r2_rmse2 function above:

    plant_name  month         r2
0   BIG HORN I      1  -0.314771
1   BIG HORN I      2   0.529890
2   BIG HORN I      3   0.804066
3   BIG HORN I      4 -22.164720
4   BIG HORN I      5  -0.460690
5   BIG HORN I      6  -4.673359
6   BIG HORN I      7  -0.662166
7   BIG HORN I      8  -2.118815
8   BIG HORN I      9  -1.946566
9   BIG HORN I     10   0.662636
10  BIG HORN I     11   0.696896
11  BIG HORN I     12   0.446235

the correct result when i test the function in Excel that I should obtain applying the function is:

plant_name  month   r2
BIG HORN I  1   0.999975202
BIG HORN I  2   0.998459857
BIG HORN I  3   0.988712352
BIG HORN I  4   0.711649414
BIG HORN I  5   0.998282523
BIG HORN I  6   0.681460011
BIG HORN I  7   0.907152074
BIG HORN I  8   0.66212225
BIG HORN I  9   0.98807953
BIG HORN I  10  0.988469127
BIG HORN I  11  0.990836283
BIG HORN I  12  0.968629237

I cannot understand why the apply using the fuction is incorrect. Thank you for your help.

Here is the computation of R squared, RMSE and Pearson correlation coefficient (as used in Excel) on your data:

from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import pearsonr
def r2_rmse2(g):
    r2 = r2_score(g['wind_speed_obs'], g['wind_speed_ms'])
    rmse = mean_squared_error(g['wind_speed_obs'], g['wind_speed_ms'], squared=False)
    correl = pearsonr(g['wind_speed_obs'], g['wind_speed_ms'])[0]
    return pd.Series( dict(  r2 = r2, rmse=rmse, correl=correl ) )
data1.groupby( ['plant_name','month'] ).apply( r2_rmse2 ).reset_index()

    plant_name  month         r2      rmse    correl
0   BIG HORN I      1  -0.314771  0.976090  0.999975
1   BIG HORN I      2   0.529890  1.072639  0.998460
2   BIG HORN I      3   0.804066  0.592633  0.988712
3   BIG HORN I      4 -22.164844  0.549141  0.711649
4   BIG HORN I      5  -0.460691  0.866068  0.998283
5   BIG HORN I      6  -4.673359  0.729833  0.681460
6   BIG HORN I      7  -0.662167  0.553450  0.907152
7   BIG HORN I      8  -2.118817  0.716380  0.662122
8   BIG HORN I      9  -1.946562  0.539102  0.988080
9   BIG HORN I     10   0.662637  0.630426  0.988469
10  BIG HORN I     11   0.696896  0.428632  0.990836
11  BIG HORN I     12   0.446234  0.519437  0.968629

这篇关于使用Sklearn指标的 pandas 相关误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆