如何使用 scikit-learn API 实现元估计器? [英] How to implement a meta-estimator with the scikit-learn API?

查看:45
本文介绍了如何使用 scikit-learn API 实现元估计器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现一个与所有 scikit-learn 兼容的简单包装器/元估计器.很难找到我到底需要什么的完整描述.

I would like to implement a simple wrapper / meta-estimator which is compatible with all of scikit-learn. It is hard to find a full description of what exactly I need.

目标是有一个回归器,它也学习成为分类器的阈值.所以我想出了:

The goal is to have a regressor which also learns a threshold to become a classifier. So I came up with:

from sklearn.base import BaseEstimator, ClassifierMixin, clone

class Thresholder(BaseEstimator, ClassifierMixin):
    def __init__(self, regressor):
        self.regressor = regressor
        # threshold_ does not get initialized in __init__ ??

    def fit(self, X, y, optimal_threshold):
        self.regressor = clone(self.regressor)    # is this required my sklearn??
        self.regressor.fit(X, y)

        y_raw = self.regressor.predict()
        self.threshold_ = optimal_threshold(y_raw)

    def predict(self, X):
        y_raw = self.regressor.predict(X)

        y = np.digitize(y_raw, [self.threshold_])

        return y

这是否实现了我需要的完整 API?

Is this implement the full API I need?

我的主要问题是将 threshold 放在哪里.我希望它只学习一次,并且可以在后续 .fit 调用中重新使用新数据而无需重新调整.但是对于当前版本,它必须在每次 .fit 调用时重新调整 - 我不想要?

My main question is where to put the threshold. I want that it gets learned only once and can be re-used in subsequent .fit calls with new data without being readjusted. But with the current version it has to be retuned on every .fit call - which I do not want?

另一方面,如果我将其设为固定参数 self.threshold 并将其传递给 __init__,那么我不应该用数据更改它?

On the other hand, if I make it a fixed parameter self.threshold and pass it to __init__, then I'm not supposed to change it with the data?

如何制作一个 threshold 参数,该参数可以在 .fit 的一次调用中进行调整,并在后续的 .fit 调用中固定?

How can I make a threshold parameter which can be tuned in one call of .fit and be fixed for subsequent .fit calls?

推荐答案

前几天我实际上写了一篇关于这个的博客文章.我假设您正在尝试构建类似于 TransformedTargetRegressor 我建议查看它的源代码以构建类似的东西.

I actually wrote a blog post about this the other day. I assume you are trying to build something similar to TransformedTargetRegressor I would suggest taking a look at its source code to build something similar.

您当前的实现似乎是正确的.就这个问题而言:

Your current implementation seems about right. As far as this concern goes:

如何制作一个阈值参数,该参数可以在 .fit 的一次调用中进行调整,并在后续的 .fit 调用中固定?

How can I make a threshold parameter which can be tuned in one call of .fit and be fixed for subsequent .fit calls?

我不建议这样做,因为 scikit-learn 的 API 基于 fit 方法重新拟合模型的所有可调方面.有两条路线你可以去这里,要么添加一个 **kwarg 来明确保护 theshold 不被更新,或者你可以使用 @rotem-tal 建议.如果你选择后者,它可能看起来像这样:

I would suggest against that because scikit-learn's API is based around the fit method re-fitting all tunable aspects of the model. There are two routes you can go here, either add a **kwarg to the fit that explicitly protects the theshold from updating or you can go with what @rotem-tal suggested. If you choose the latter, it might look something like this:

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin

def optimal_threshold(y_raw: np.ndarray) -> np.ndarray:
    return np.array([0.1, 0.5, 1])  # some implementation here

class Thresholder(BaseEstimator, ClassifierMixin):
    def __init__(self, regressor):
        self.regressor = regressor
        self.threshold = None

    def fit(self, X, y, optimal_threshold):
        # you don't need to clone the regressor
        self.regressor.fit(X, y)

        y_raw = self.regressor.predict()
        if self.threshold is None:
            self.threshold = optimal_threshold(y_raw)

    def predict(self, X):
        y_raw = self.regressor.predict(X)

        y = np.digitize(y_raw, [self.threshold_])

        return y

这篇关于如何使用 scikit-learn API 实现元估计器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆