如何使用 scikit-learn API 实现元估计器? [英] How to implement a meta-estimator with the scikit-learn API?
问题描述
我想实现一个与所有 scikit-learn 兼容的简单包装器/元估计器.很难找到我到底需要什么的完整描述.
I would like to implement a simple wrapper / meta-estimator which is compatible with all of scikit-learn. It is hard to find a full description of what exactly I need.
目标是有一个回归器,它也学习成为分类器的阈值.所以我想出了:
The goal is to have a regressor which also learns a threshold to become a classifier. So I came up with:
from sklearn.base import BaseEstimator, ClassifierMixin, clone
class Thresholder(BaseEstimator, ClassifierMixin):
def __init__(self, regressor):
self.regressor = regressor
# threshold_ does not get initialized in __init__ ??
def fit(self, X, y, optimal_threshold):
self.regressor = clone(self.regressor) # is this required my sklearn??
self.regressor.fit(X, y)
y_raw = self.regressor.predict()
self.threshold_ = optimal_threshold(y_raw)
def predict(self, X):
y_raw = self.regressor.predict(X)
y = np.digitize(y_raw, [self.threshold_])
return y
这是否实现了我需要的完整 API?
Is this implement the full API I need?
我的主要问题是将 threshold
放在哪里.我希望它只学习一次,并且可以在后续 .fit
调用中重新使用新数据而无需重新调整.但是对于当前版本,它必须在每次 .fit
调用时重新调整 - 我不想要?
My main question is where to put the threshold
. I want that it gets learned only once and can be re-used in subsequent .fit
calls with new data without being readjusted. But with the current version it has to be retuned on every .fit
call - which I do not want?
另一方面,如果我将其设为固定参数 self.threshold
并将其传递给 __init__
,那么我不应该用数据更改它?
On the other hand, if I make it a fixed parameter self.threshold
and pass it to __init__
, then I'm not supposed to change it with the data?
如何制作一个 threshold
参数,该参数可以在 .fit
的一次调用中进行调整,并在后续的 .fit
调用中固定?
How can I make a threshold
parameter which can be tuned in one call of .fit
and be fixed for subsequent .fit
calls?
推荐答案
前几天我实际上写了一篇关于这个的博客文章.我假设您正在尝试构建类似于 TransformedTargetRegressor
我建议查看它的源代码以构建类似的东西.
I actually wrote a blog post about this the other day. I assume you are trying to build something similar to TransformedTargetRegressor
I would suggest taking a look at its source code to build something similar.
您当前的实现似乎是正确的.就这个问题而言:
Your current implementation seems about right. As far as this concern goes:
如何制作一个阈值参数,该参数可以在 .fit
的一次调用中进行调整,并在后续的 .fit
调用中固定?
How can I make a threshold parameter which can be tuned in one call of
.fit
and be fixed for subsequent.fit
calls?
我不建议这样做,因为 scikit-learn
的 API 基于 fit
方法重新拟合模型的所有可调方面.有两条路线你可以去这里,要么添加一个 **kwarg
来明确保护 theshold
不被更新,或者你可以使用 @rotem-tal 建议.如果你选择后者,它可能看起来像这样:
I would suggest against that because scikit-learn
's API is based around the fit
method re-fitting all tunable aspects of the model. There are two routes you can go here, either add a **kwarg
to the fit that explicitly protects the theshold
from updating or you can go with what @rotem-tal suggested. If you choose the latter, it might look something like this:
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
def optimal_threshold(y_raw: np.ndarray) -> np.ndarray:
return np.array([0.1, 0.5, 1]) # some implementation here
class Thresholder(BaseEstimator, ClassifierMixin):
def __init__(self, regressor):
self.regressor = regressor
self.threshold = None
def fit(self, X, y, optimal_threshold):
# you don't need to clone the regressor
self.regressor.fit(X, y)
y_raw = self.regressor.predict()
if self.threshold is None:
self.threshold = optimal_threshold(y_raw)
def predict(self, X):
y_raw = self.regressor.predict(X)
y = np.digitize(y_raw, [self.threshold_])
return y
这篇关于如何使用 scikit-learn API 实现元估计器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!