Scikit-Learn 给出不正确的 R 平方值 [英] Scikit-Learn giving incorrect R Squared value

查看:32
本文介绍了Scikit-Learn 给出不正确的 R 平方值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Python 上训练机器学习模型,并使用来自 Scikit Learn 的 R 平方度量来评估它们.Id 决定使用 Scikit 的 r2_score 函数,向它提供一个与输入 y_true 值相同的随机数组,以及与 y_predict 略有不同但相同的值数组.当数组的输入长度为 10 或更多时,我得到了任意大(负)值,当输入长度小于 10 时,我得到了 0.

from sklearn.metrics import r2_scorer2_score([213.91666667, 213.91666667, 213.91666667, 213.91666667, 213.91666667,213.91666667, 213.91666667, 213.91666667, 213.91666667, 213.91666667],[213, 214, 214, 214, 214, 214, 214, 214, 214, 214])>>>-1.1175847590636849e+26r2_score([213.91666667, 213.91666667, 213.91666667, 213.91666667,213.91666667, 213.91666667, 213.91666667, 213.91666667, 213.91666667],[213, 214, 214, 214, 214, 214, 214, 214, 214])>>>0

解决方案

您正确地注意到 r2_score 输出不正确.然而,这是一个更简单的计算问题的结果,而不是 scikit-learn 包的问题.

尝试运行

<预><代码>>>>input_list = [213.91666667, 213.91666667, 213.91666667, 213.91666667, 213.91666667,213.91666667、213.91666667、213.91666667、213.91666667、213.91666667]>>>总和(输入列表)/len(输入列表)

如您所见,输出不完全是 213.91666667(有限的精度误差;您可以阅读有关它的更多信息

如您所见,r2_score 只是 1 -(残差平方和)/(总平方和).

在您指定的第一种情况下,残差平方和等于某个……并不重要的数字.您可以轻松计算;大约是 0.09,这似乎不是很高.然而,由于上述浮点误差,总平方和不完全是 0,而是一些非常非常小的数字(想想大约 10^-28 -- 非常 小).

因此,当您将残差平方和(大约 0.09)除以总平方和(一个非常小的数字)时,您会得到一个非常大的数字.由于从 1 中减去了这个大数,因此您的 r2_score 输出将得到一个高幅度的负数.

在第二种情况下不会出现计算总平方和的不精确性,因此分母为 0,并且函数在计算中看到未定义的值,应返回 0.

I'm training Machine Learning models on Python and using R squared metric from Scikit Learn to evaluate them. Id decided to play around with Scikit's r2_score function, feeding it a random array of same value as input y_true and and slightly different but same value array as y_predict. I was getting arbitrarily large (negative) values when the input length of array is 10 or more and 0 when the input length is less than 10.

from sklearn.metrics import r2_score
r2_score([213.91666667,  213.91666667,  213.91666667,  213.91666667,  213.91666667, 
      213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667],
    [213,  214,  214,  214,  214,  214,  214,  214,  214,  214])

>>> -1.1175847590636849e+26

r2_score([213.91666667,  213.91666667,  213.91666667,  213.91666667, 
      213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667],
    [213,  214,  214,  214,  214,  214,  214,  214,  214])

>>> 0

解决方案

You're correct in noting that the r2_score output is not correct. However, this is a result of a simpler computation issue rather than a problem with the scikit-learn package.

Try running

>>> input_list = [213.91666667,  213.91666667,  213.91666667,  213.91666667,  213.91666667, 
  213.91666667, 213.91666667,  213.91666667,  213.91666667,  213.91666667]
>>> sum(input_list)/len(input_list)

As you can see, the output is not exactly 213.91666667 (a limited precision error; you can read more about it here). Why does this matter?

Well, the section of the scikit-learn User Guide gives the specific formula used to calculate r2_score:

As you can see, the r2_score is simply 1 - (residual sum of squares)/(total sum of squares).

In the first case you specify, the residual sum of squares is equal to some number that...doesn't really matter. You can calculate it easily; it's about 0.09, which doesn't seem super high. However, due to the floating point error described above, the total sum of squares isn't exactly 0, but rather some very, very small number (think around 10^-28 -- very small).

Thus, when you divide residual sum of squares (around 0.09) by total sum of squares (a very small number), you're left with a very large number. Since that large number is subtracted from 1, you are left with a negative number of high magnitude as your r2_score output.

This imprecision in the calculation of total sum of squares does not occur in the second case, so the denominator is 0 and the function, seeing an undefined value from of the calculations, should return 0.

这篇关于Scikit-Learn 给出不正确的 R 平方值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆