SCRKIT-学习Logistic回归简历:最佳系数 [英] scikit-learn LogisticRegressionCV: best coefficients

查看:0
本文介绍了SCRKIT-学习Logistic回归简历:最佳系数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解在Logistic回归交叉验证中如何计算最佳系数,其中"refit"参数为True。 如果我对docs的理解是正确的,那么最好的系数是首先确定最佳正则化参数"C"的结果,即在所有折叠上具有最高平均分数的C值。然后,最好的系数就是在最佳C得分最高的折叠上计算的系数。我假设,如果最大分数被几个折叠获得,则这些折叠的系数将被平均,以得到最佳系数(我在文档中没有看到任何关于如何处理这种情况的内容)。

为了测试我的理解能力,我用两种不同的方法确定了最佳系数:

  1. 直接从拟合模型的coef_属性,和
  2. 来自COEF_PATHS属性,该属性包含在跨每个文件夹然后跨每个C进行交叉验证时获得的系数的路径。

我从1.和2.得到的结果相似但不完全相同,所以我希望有人能指出我在这里做错了什么。 谢谢!

演示该问题的示例:

from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]

# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)

# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1', 
                           refit=True, scoring='roc_auc', 
                           solver='liblinear', random_state=0,
                           fit_intercept=False)
clf.fit(X_train_scaled, y_train)

########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")

########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the 
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]

paths = clf.coefs_paths_[1]  # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")

推荐答案

我认为本文回答了您的问题:https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94

关键是Logistic RegressionCV的Refit参数。 根据skLearning(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)

refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.

最佳。

这篇关于SCRKIT-学习Logistic回归简历:最佳系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆