在其fit方法带有3个参数的自定义类上使用sklearn GridSearchCV [英] Use sklearn GridSearchCV on custom class whose fit method takes 3 arguments
问题描述
我正在从事一个项目,该项目涉及将某些算法实现为python类并测试其性能.我决定将它们编写为sklearn估算器,以便可以使用 GridSearchCV
进行验证.
但是,我的归纳矩阵完成的算法之一不仅将X
和y
作为参数.这对于 X
和y
传递给估计器的fit方法.源显示了GridSearchCV.fit
的以下参数:
def fit(self, X, y=None, groups=None, **fit_params):
当然,下游方法只需要这两个参数.显然,修改我的GridSearchCV
本地副本以适应我的需求绝非易事(或不建议这样做).
def fit(self, R, X, Y):
因此尝试以下操作失败,因为Y值永远不会传递给IMC.fit
方法:
imc = IMC()
params = {...}
gs = GridSearchCV(imc, param_grid=params)
gs.fit(R, X, Y)
我已经通过修改IMC.fit
方法(为此也必须将其插入到score
方法中)创建了一种解决方法:
def fit(self, R, X, Y=None):
if Y is None:
split = np.where(np.all(X == 999, axis=0))[0][0]
Y = X[:, split + 1:]
X = X[:, :split]
...
这使我可以使用 numpy.hstack
以水平堆叠X和Y,并在它们之间插入所有999
的列.然后可以将该数组传递给GridSearchCV.fit
,如下所示:
data = np.hstack([X, np.ones((X.shape[0],1)) * 999, Y])
gs.fit(R, data)
此方法有效,但感觉很棘手.因此,我的问题是这样的:
是否存在使用GridSearchCV
将两个以上的参数传递给fit方法的普遍接受的方法或最佳实践?
因此,从朋友那里获得了一些启发之后(GridSearchCV
for validation.
However, one of my algorithms for Inductive Matrix Completion takes more than just X
and y
as arguments. This becomes a problem for the GridSearchCV.fit
as there appears to be no way to pass more than just X
and y
to the fit method of the estimator. The source shows the following arguments for GridSearchCV.fit
:
def fit(self, X, y=None, groups=None, **fit_params):
And of course the downstream methods expect only these two arguments. Obviously it would be no trivial task (or advisable) to modify my local copy of GridSearchCV
to accommodate my needs.
For reference IMC basically states that $ R \approx XW^THY^T $. So my fit method takes the following form:
def fit(self, R, X, Y):
So trying the following fails as the Y value never gets passed to the IMC.fit
method:
imc = IMC()
params = {...}
gs = GridSearchCV(imc, param_grid=params)
gs.fit(R, X, Y)
I've created a workaround for this by modifying the IMC.fit
method like so (this also has to be inserted into the score
method):
def fit(self, R, X, Y=None):
if Y is None:
split = np.where(np.all(X == 999, axis=0))[0][0]
Y = X[:, split + 1:]
X = X[:, :split]
...
This allows me to use numpy.hstack
to stack X and Y horizontally and insert a column of all 999
between them. This array can then be passed to GridSearchCV.fit
as follows:
data = np.hstack([X, np.ones((X.shape[0],1)) * 999, Y])
gs.fit(R, data)
This approach works, but feels pretty hacky. Therefore my question is this:
Is there a generally accepted way or best practice for passing more than 2 arguments to a fit method using GridSearchCV
?
So after getting some inspiration from a friend on this (@Matthew Drury) I constructed a much more elegant solution.
Again the problem is framed as such:
I have a matrix completion method that takes X
, Y
, and R
as arguments and attempts to construct W
and H
that minimize R - XWHY
for all observed indices in R
. A basic implementation of a fit
method would look like this:
def fit(X, Y, R):
W, H = do_minimization(X, Y, R)
return W, H
This doesn't fit well into the standard sklearn model where fit takes an X
(the features that feed into a model) and y
(the results) and looks like this:
def fit(X, y):
W, H = do_minimization(X, y)
return W, H
This isn't really an issue until you start using GridSearchCV
or other cross validation methods as they expect the data to fit the latter format. So to marry these two concepts I needed a way of packaging two disparate matrices X
and Y
into a single structure without losing the separate nature of the two.
In the 5 minutes I had to dedicate to this originally I came up with the hacky solution. In a matrix R
shape n, m
where the rows correspond to the records in X
and the columns correspond to the records in Y
, there are b
total entries. If we take the row and column indices for all of these entries and index X
on the rows and Y
on the columns we will end up with equal length matrices for X
and Y
. These can then be stacked horizontally, separated by a column of nonsense, and passed to the cross validation methods without issue (we just need a couple helper methods inside the original class to reconstruct the original X
and Y
from the stack before fitting.
The point of this question was to find the elegant solution, or preferably an existing solution. That doesn't seem to be the case so I will propose the following model for any future estimators/classifiers built inheriting from sklearn that require more than just a single feature matrix for the fit method.
Create a DataHandler
When using GridSearchCV
the fit
method does a round of checks before firing off any calls to the estimators fit
method. One of these determines if the passed X
array is indexable. This test basically checks if X
implements __getitem__
or iloc
and is the same length as y
. This length check requires X
to have a shape
attribute. At that point the split indices and fits can be computed as expected. So we need a wrapper that implements __getitem__
and has a shape
attribute.
class DataHandler(object):
def __init(self, X, Y):
self.X = X
self.Y = Y
self.shape = self.X.shape
def __getitem__(self, x):
return self.X[x], self.Y[x]
Thats it! We can now modify the fit
method to match the sklearn style, but in this case instead of X
being an array, it will either be a tuple (the result returned by the __getitem__
method) or an instance of our DataHandler
class.
Now GridSearchCV
will work as expected by just passing an instance of a DataHandler
containing the X
and Y
arrays.
这篇关于在其fit方法带有3个参数的自定义类上使用sklearn GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!