StatsModels 的预测函数如何与 scikit-learn 的 roc_auc_score 交互? [英] How does the predict function of StatsModels interact with roc_auc_score of scikit-learn?
问题描述
我正在尝试了解 Logit 模型的 Python statsmodels 中的 predict
函数.它的文档位于此处.
I am trying to understand the predict
function in Python statsmodels for a Logit model. Its documentation is here.
当我构建 Logit 模型并使用 predict
时,它返回从 0 到 1 的值,而不是 0 或 1.现在我读到这篇文章说这些是概率,我们需要一个阈值.Python statsmodel.api 逻辑回归 (Logit)
When I build a Logit Model and use predict
, it returns values from 0 to 1 as opposed to 0 or 1. Now I read this saying these are probabilities and we need a threshold. Python statsmodel.api logistic regression (Logit)
现在,我想生成 AUC 数字,我使用 sklearn 中的 roc_auc_score
(文档).
Now, I want to produce AUC numbers and I use roc_auc_score
from sklearn (docs).
这就是我开始困惑的时候.
Here is when I start getting confused.
- 当我将 Logit 模型的原始预测值(概率)放入
roc_auc_score
作为第二个参数y_score
时,我得到了大约 80 的合理 AUC 值%.roc_auc_score
函数如何知道我的哪些概率等于 1,哪些等于 0?我没有机会设置门槛. - 当我使用 0.5 的阈值手动将概率转换为 0 或 1 时,我得到的 AUC 约为 50%.为什么会发生这种情况?
- When I put in the raw predicted values (probabilities) from my Logit model into the
roc_auc_score
as the second argumenty_score
, I get a reasonable AUC value of around 80%. How does theroc_auc_score
function know which of my probabilities equate to 1 and which equate to 0? Nowhere was I given an opportunity to set the threshold. - When I manually convert my probabilities into 0 or 1 using a threshold of 0.5, I get an AUC of around 50%. Why would this happen?
这是一些代码:
m1_result = m1.fit(disp = False)
roc_auc_score(y, m1_result.predict(X1))
AUC: 0.80
roc_auc_score(y, [1 if X >=0.5 else 0 for X in m1_result.predict(X1)])
AUC: 0.50
为什么会这样?
推荐答案
您计算 AUC 的第二种方法是错误的;根据定义,AUC 需要概率,而不是阈值化后生成的硬类预测 0/1,就像您在此处所做的那样.所以,你的 AUC 是 0.80.
Your 2nd way of calculating the AUC is wrong; by definition, AUC needs probabilities, and not hard class predictions 0/1 generated after thresholding, as you do here. So, your AUC is 0.80.
你在AUC计算中没有自己设置阈值;粗略地说,正如我在其他地方所解释的那样,AUC 衡量二元分类器的性能在所有可能的决策阈值上取平均值.
You don't set a threshold yourself in AUC calculation; roughly speaking, as I have explained elsewhere, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
在这里再次解释 AUC 计算的基本原理和细节是多余的;相反,这些其他 SE 线程(以及其中的链接)将帮助您了解这个想法:
It would be overkill to explain again here the rationale and details of AUC calculation; instead, these other SE threads (and the links therein) will help you get the idea:
这篇关于StatsModels 的预测函数如何与 scikit-learn 的 roc_auc_score 交互?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!