sklearn GridSearchCV 在评分函数中不使用 sample_weight [英] sklearn GridSearchCV not using sample_weight in score function

查看:21
本文介绍了sklearn GridSearchCV 在评分函数中不使用 sample_weight的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有每个样本权重不同的数据.在我的申请中,重要的是在估计模型和比较替代模型时考虑这些权重.

我正在使用 sklearn 来估计模型并比较替代超参数选择.但是这个单元测试表明 GridSearchCV 没有应用 sample_weights 来估计分数.

有没有办法让 sklearn 使用 sample_weight 对模型进行评分?

单元测试:

from __future__ 导入师将 numpy 导入为 np从 sklearn.datasets 导入 load_iris从 sklearn.ensemble 导入 RandomForestClassifier从 sklearn.metrics 导入 log_loss从 sklearn.model_selection 导入 GridSearchCV,RepeatedKFolddef grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):out_results = dict()对于 max_features_grid 中的 k:clf = RandomForestClassifier(n_estimators=256,标准=熵",暖启动=假,n_jobs=-1,随机状态=随机状态,max_features=k)对于 cv.split(X=X_in, y=y_in) 中的 train_ndx、test_ndx:X_train = X_in[train_ndx, :]y_train = y_in[train_ndx]w_train = w_in[train_ndx]y_test = y[test_ndx]clf.fit(X=X_train, y=y_train, sample_weight=w_train)y_hat = clf.predict_proba(X=X_in[test_ndx,:])如果使用权重:w_test = w_in[test_ndx]w_i_sum = w_test.sum()分数 = w_i_sum/w_in.sum() * log_loss(y_true=y_test, y_pred=y_hat, sample_weight=w_test)别的:分数 = log_loss(y_true=y_test, y_pred=y_hat)结果 = out_results.get(k, [])结果.附加(分数)out_results.update({k: 结果})对于 k, v in out_results.items():如果使用权重:mean_score = sum(v)别的:mean_score = np.mean(v)out_results.update({k: mean_score})best_score = min(out_results.values())best_param = min(out_results, key=out_results.get)返回 best_score, best_param如果 __name__ == "__main__":随机状态 = 1337X, y = load_iris(return_X_y=True)sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])# sample_weight = np.array([1 for _ in range(len(X))])inner_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)external_cv = RepeatedKFold(n_splits=3, n_repeas=1, random_state=RANDOM_STATE)rfc = RandomForestClassifier(n_estimators=256,标准=熵",暖启动=假,n_jobs=-1,random_state=RANDOM_STATE)search_params = {"max_features": [1, 2, 3, 4]}fit_params = {sample_weight":sample_weight}my_scorer = make_scorer(log_loss,Greater_is_better=假,need_proba=真,需要_阈值=假)grid_clf = GridSearchCV(estimator=rfc,得分=我的得分者,简历=内部_简历,param_grid=search_params,改装=真,return_train_score=假,iid=False) # 在这个用法中,`iid=True` 和 `iid=False` 的结果是一样的grid_clf.fit(X, y, **fit_params)print("这是使用 GridSearchCV 的最佳样本外得分:%.6f." % -grid_clf.best_score_)msg = """这是使用 grid_cv 的最佳样本外得分 %s 加权:%.6f."""score_with_weights, param_with_weights = grid_cv(X_in=X,y_in=y,w_in=sample_weight,简历=内部_简历,max_features_grid=search_params.get("max_features"),use_weighting=True)打印(味精%(WITH",score_with_weights))score_without_weights, param_without_weights = grid_cv(X_in=X,y_in=y,w_in=sample_weight,简历=内部_简历,max_features_grid=search_params.get("max_features"),use_weighting=False)打印(味精%(没有",score_without_weights))

产生输出:

这是使用 GridSearchCV 的最佳样本外得分:0.135692.这是使用 grid_cv 加权的最佳样本外分数:0.099367.这是没有使用 grid_cv 加权的最佳样本外分数:0.135692.

说明:由于手动计算没有加权的损失产生与GridSearchCV相同的评分,我们知道样本权重没有被使用.

解决方案

GridSearchCV 接受一个 scoring 作为输入,它可以被调用.您可以查看如何更改评分函数的详细信息,以及如何传递您自己的评分函数

EDIT:fit_params 仅传递给拟合函数,而不传递给评分函数.如果有应该传递给 scorer 的参数,它们应该传递给 make_scorer.但这仍然不能解决这里的问题,因为这意味着整个 sample_weight 参数将传递给 log_loss,而只有对应于 的部分y_test 在计算损失时应该通过.

sklearn 不支持这样的东西,但你可以通过使用 padas.DataFrame 破解你的方法.好消息是,sklearn 理解 DataFrame,并保持这种方式.这意味着您可以利用 DataFrameindex,如您在此处的代码中看到的:

 # 更多代码X, y = load_iris(return_X_y=True)index = ['r%d' % x for x in range(len(y))]y_frame = pd.DataFrame(y, index=index)sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])sample_weight_frame = pd.DataFrame(sample_weight, index=index)# 更多代码def score_f(y_true, y_pred, sample_weight):返回 log_loss(y_true.values, y_pred,sample_weight=sample_weight.loc[y_true.index.values].values.reshape(-1),归一化=真)score_params = {"sample_weight": sample_weight_frame}my_scorer = make_scorer(score_f,Greater_is_better=假,need_proba=真,需要阈值=假,**score_params)grid_clf = GridSearchCV(estimator=rfc,得分=我的得分者,简历=内部_简历,param_grid=search_params,改装=真,return_train_score=假,iid=False) # 在这个用法中,`iid=True` 和 `iid=False` 的结果是一样的grid_clf.fit(X, y_frame)# 更多代码

如您所见,score_f 使用 y_trueindex 来查找要使用的 sample_weight 的哪些部分.为了完整起见,这里是整个代码:

from __future__ 导入师将 numpy 导入为 np从 sklearn.datasets 导入 load_iris从 sklearn.ensemble 导入 RandomForestClassifier从 sklearn.metrics 导入 log_loss从 sklearn.model_selection 导入 GridSearchCV,RepeatedKFold从 sklearn.metrics 导入 make_scorer将熊猫导入为 pddef grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):out_results = dict()对于 max_features_grid 中的 k:clf = RandomForestClassifier(n_estimators=256,标准=熵",暖启动=假,n_jobs=1,随机状态=随机状态,max_features=k)对于 cv.split(X=X_in, y=y_in) 中的 train_ndx、test_ndx:X_train = X_in[train_ndx, :]y_train = y_in[train_ndx]w_train = w_in[train_ndx]y_test = y_in[test_ndx]clf.fit(X=X_train, y=y_train, sample_weight=w_train)y_hat = clf.predict_proba(X=X_in[test_ndx,:])如果使用权重:w_test = w_in[test_ndx]w_i_sum = w_test.sum()分数 = w_i_sum/w_in.sum() * log_loss(y_true=y_test, y_pred=y_hat, sample_weight=w_test)别的:分数 = log_loss(y_true=y_test, y_pred=y_hat)结果 = out_results.get(k, [])结果.附加(分数)out_results.update({k: 结果})对于 k, v in out_results.items():如果使用权重:mean_score = sum(v)别的:mean_score = np.mean(v)out_results.update({k: mean_score})best_score = min(out_results.values())best_param = min(out_results, key=out_results.get)返回 best_score, best_param#if __name__ == "__main__":如果是真的:随机状态 = 1337X, y = load_iris(return_X_y=True)index = ['r%d' % x for x in range(len(y))]y_frame = pd.DataFrame(y, index=index)sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])sample_weight_frame = pd.DataFrame(sample_weight, index=index)# sample_weight = np.array([1 for _ in range(len(X))])inner_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)external_cv = RepeatedKFold(n_splits=3, n_repeas=1, random_state=RANDOM_STATE)rfc = RandomForestClassifier(n_estimators=256,标准=熵",暖启动=假,n_jobs=1,random_state=RANDOM_STATE)search_params = {"max_features": [1, 2, 3, 4]}def score_f(y_true, y_pred, sample_weight):返回 log_loss(y_true.values, y_pred,sample_weight=sample_weight.loc[y_true.index.values].values.reshape(-1),归一化=真)score_params = {"sample_weight": sample_weight_frame}my_scorer = make_scorer(score_f,Greater_is_better=假,need_proba=真,需要阈值=假,**score_params)grid_clf = GridSearchCV(estimator=rfc,得分=我的得分者,简历=内部_简历,param_grid=search_params,改装=真,return_train_score=假,iid=False) # 在这个用法中,`iid=True` 和 `iid=False` 的结果是一样的grid_clf.fit(X, y_frame)print("这是使用 GridSearchCV 的最佳样本外得分:%.6f." % -grid_clf.best_score_)msg = """这是使用 grid_cv 的最佳样本外得分 %s 加权:%.6f."""score_with_weights, param_with_weights = grid_cv(X_in=X,y_in=y,w_in=sample_weight,简历=内部_简历,max_features_grid=search_params.get("max_features"),use_weighting=True)打印(味精%(WITH",score_with_weights))score_without_weights, param_without_weights = grid_cv(X_in=X,y_in=y,w_in=sample_weight,简历=内部_简历,max_features_grid=search_params.get("max_features"),use_weighting=False)打印(味精%(没有",score_without_weights))

代码的输出为:

这是使用 GridSearchCV 的最佳样本外得分:0.095439.这是使用 grid_cv 加权的最佳样本外分数:0.099367.这是没有使用 grid_cv 加权的最佳样本外分数:0.135692.

EDIT 2:正如下面的评论所说:

<块引用>

我的分数与使用此解决方案的 sklearn 分数的差异起源于我计算加权平均值的方式分数.如果省略代码的加权平均部分,则两个输出与机器精度匹配.

I have data with differing weights for each sample. In my application, it is important that these weights are accounted for in estimating the model and comparing alternative models.

I'm using sklearn to estimate models and to compare alternative hyperparameter choices. But this unit test shows that GridSearchCV does not apply sample_weights to estimate scores.

Is there a way to have sklearn use sample_weight to score the models?

Unit test:

from __future__ import division

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV, RepeatedKFold


def grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):
  out_results = dict()

  for k in max_features_grid:
    clf = RandomForestClassifier(n_estimators=256,
                                 criterion="entropy",
                                 warm_start=False,
                                 n_jobs=-1,
                                 random_state=RANDOM_STATE,
                                 max_features=k)
    for train_ndx, test_ndx in cv.split(X=X_in, y=y_in):
      X_train = X_in[train_ndx, :]
      y_train = y_in[train_ndx]
      w_train = w_in[train_ndx]
      y_test = y[test_ndx]

      clf.fit(X=X_train, y=y_train, sample_weight=w_train)

      y_hat = clf.predict_proba(X=X_in[test_ndx, :])
      if use_weighting:
        w_test = w_in[test_ndx]
        w_i_sum = w_test.sum()
        score = w_i_sum / w_in.sum() * log_loss(y_true=y_test, y_pred=y_hat, sample_weight=w_test)
      else:
        score = log_loss(y_true=y_test, y_pred=y_hat)

      results = out_results.get(k, [])
      results.append(score)
      out_results.update({k: results})

  for k, v in out_results.items():
    if use_weighting:
      mean_score = sum(v)
    else:
      mean_score = np.mean(v)
    out_results.update({k: mean_score})

  best_score = min(out_results.values())
  best_param = min(out_results, key=out_results.get)
  return best_score, best_param


if __name__ == "__main__":
  RANDOM_STATE = 1337
  X, y = load_iris(return_X_y=True)
  sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
  # sample_weight = np.array([1 for _ in range(len(X))])

  inner_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  outer_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                               warm_start=False,
                               n_jobs=-1,
                               random_state=RANDOM_STATE)
  search_params = {"max_features": [1, 2, 3, 4]}


  fit_params = {"sample_weight": sample_weight}
  my_scorer = make_scorer(log_loss, 
               greater_is_better=False, 
               needs_proba=True, 
               needs_threshold=False)

  grid_clf = GridSearchCV(estimator=rfc,
                          scoring=my_scorer,
                          cv=inner_cv,
                          param_grid=search_params,
                          refit=True,
                          return_train_score=False,
                          iid=False)  # in this usage, the results are the same for `iid=True` and `iid=False`
  grid_clf.fit(X, y, **fit_params)
  print("This is the best out-of-sample score using GridSearchCV: %.6f." % -grid_clf.best_score_)

  msg = """This is the best out-of-sample score %s weighting using grid_cv: %.6f."""
  score_with_weights, param_with_weights = grid_cv(X_in=X,
                                                   y_in=y,
                                                   w_in=sample_weight,
                                                   cv=inner_cv,
                                                   max_features_grid=search_params.get(
                                                     "max_features"),
                                                   use_weighting=True)
  print(msg % ("WITH", score_with_weights))

  score_without_weights, param_without_weights = grid_cv(X_in=X,
                                                         y_in=y,
                                                         w_in=sample_weight,
                                                         cv=inner_cv,
                                                         max_features_grid=search_params.get(
                                                           "max_features"),
                                                         use_weighting=False)
  print(msg % ("WITHOUT", score_without_weights))

Which produces output:

This is the best out-of-sample score using GridSearchCV: 0.135692.
This is the best out-of-sample score WITH weighting using grid_cv: 0.099367.
This is the best out-of-sample score WITHOUT weighting using grid_cv: 0.135692.

Explanation: Since manually computing the loss without weighting produces the same scoring as GridSearchCV, we know that the sample weights are not being used.

解决方案

The GridSearchCV takes a scoring as input, which can be callable. You can see the details of how to change the scoring function, and also how to pass your own scoring function here. Here's the relevant piece of code from that page for the sake of completeness:

EDIT: The fit_params is passed only to the fit functions, and not the score functions. If there are parameters which are supposed to be passed to the scorer, they should be passed to the make_scorer. But that still doesn't solve the issue here, since that would mean that the whole sample_weight parameter would be passed to log_loss, whereas only the part which corresponds to y_test at the time of calculating the loss should be passed.

sklearn does NOT support such a thing, but you can hack your way through, using a padas.DataFrame. The good news is, sklearn understands a DataFrame, and keeps it that way. Which means you can exploit the index of a DataFrame as you see in the code here:

  # more code

  X, y = load_iris(return_X_y=True)
  index = ['r%d' % x for x in range(len(y))]
  y_frame = pd.DataFrame(y, index=index)
  sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
  sample_weight_frame = pd.DataFrame(sample_weight, index=index)

  # more code

  def score_f(y_true, y_pred, sample_weight):
      return log_loss(y_true.values, y_pred,
                      sample_weight=sample_weight.loc[y_true.index.values].values.reshape(-1),
                      normalize=True)

  score_params = {"sample_weight": sample_weight_frame}
  my_scorer = make_scorer(score_f,
                          greater_is_better=False, 
                          needs_proba=True, 
                          needs_threshold=False,
                          **score_params)

  grid_clf = GridSearchCV(estimator=rfc,
                          scoring=my_scorer,
                          cv=inner_cv,
                          param_grid=search_params,
                          refit=True,
                          return_train_score=False,
                          iid=False)  # in this usage, the results are the same for `iid=True` and `iid=False`
  grid_clf.fit(X, y_frame)

  # more code

As you see, the score_f uses the index of y_true to find which parts of sample_weight to use. For the sake of completeness, here's the whole code:

from __future__ import division

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.metrics import  make_scorer
import pandas as pd

def grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):
  out_results = dict()

  for k in max_features_grid:
    clf = RandomForestClassifier(n_estimators=256,
                                 criterion="entropy",
                                 warm_start=False,
                                 n_jobs=1,
                                 random_state=RANDOM_STATE,
                                 max_features=k)
    for train_ndx, test_ndx in cv.split(X=X_in, y=y_in):
      X_train = X_in[train_ndx, :]
      y_train = y_in[train_ndx]
      w_train = w_in[train_ndx]
      y_test = y_in[test_ndx]

      clf.fit(X=X_train, y=y_train, sample_weight=w_train)

      y_hat = clf.predict_proba(X=X_in[test_ndx, :])
      if use_weighting:
        w_test = w_in[test_ndx]
        w_i_sum = w_test.sum()
        score = w_i_sum / w_in.sum() * log_loss(y_true=y_test, y_pred=y_hat, sample_weight=w_test)
      else:
        score = log_loss(y_true=y_test, y_pred=y_hat)

      results = out_results.get(k, [])
      results.append(score)
      out_results.update({k: results})

  for k, v in out_results.items():
    if use_weighting:
      mean_score = sum(v)
    else:
      mean_score = np.mean(v)
    out_results.update({k: mean_score})

  best_score = min(out_results.values())
  best_param = min(out_results, key=out_results.get)
  return best_score, best_param


#if __name__ == "__main__":
if True:
  RANDOM_STATE = 1337
  X, y = load_iris(return_X_y=True)
  index = ['r%d' % x for x in range(len(y))]
  y_frame = pd.DataFrame(y, index=index)
  sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
  sample_weight_frame = pd.DataFrame(sample_weight, index=index)
  # sample_weight = np.array([1 for _ in range(len(X))])

  inner_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  outer_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                               warm_start=False,
                               n_jobs=1,
                               random_state=RANDOM_STATE)
  search_params = {"max_features": [1, 2, 3, 4]}


  def score_f(y_true, y_pred, sample_weight):
      return log_loss(y_true.values, y_pred,
                      sample_weight=sample_weight.loc[y_true.index.values].values.reshape(-1),
                      normalize=True)

  score_params = {"sample_weight": sample_weight_frame}
  my_scorer = make_scorer(score_f,
                          greater_is_better=False, 
                          needs_proba=True, 
                          needs_threshold=False,
                          **score_params)

  grid_clf = GridSearchCV(estimator=rfc,
                          scoring=my_scorer,
                          cv=inner_cv,
                          param_grid=search_params,
                          refit=True,
                          return_train_score=False,
                          iid=False)  # in this usage, the results are the same for `iid=True` and `iid=False`
  grid_clf.fit(X, y_frame)
  print("This is the best out-of-sample score using GridSearchCV: %.6f." % -grid_clf.best_score_)

  msg = """This is the best out-of-sample score %s weighting using grid_cv: %.6f."""
  score_with_weights, param_with_weights = grid_cv(X_in=X,
                                                   y_in=y,
                                                   w_in=sample_weight,
                                                   cv=inner_cv,
                                                   max_features_grid=search_params.get(
                                                     "max_features"),
                                                   use_weighting=True)
  print(msg % ("WITH", score_with_weights))

  score_without_weights, param_without_weights = grid_cv(X_in=X,
                                                         y_in=y,
                                                         w_in=sample_weight,
                                                         cv=inner_cv,
                                                         max_features_grid=search_params.get(
                                                           "max_features"),
                                                         use_weighting=False)
  print(msg % ("WITHOUT", score_without_weights))

The output of the code is then:

This is the best out-of-sample score using GridSearchCV: 0.095439.
This is the best out-of-sample score WITH weighting using grid_cv: 0.099367.
This is the best out-of-sample score WITHOUT weighting using grid_cv: 0.135692.

EDIT 2: as the comment bellow says:

the difference in my score and the sklearn score using this solution originates in the way that I was computing a weighted average of scores. If you omit the weighted average portion of the code, the two outputs match to machine precision.

这篇关于sklearn GridSearchCV 在评分函数中不使用 sample_weight的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆