在其fit方法带有3个参数的自定义类上使用sklearn GridSearchCV [英] Use sklearn GridSearchCV on custom class whose fit method takes 3 arguments

查看:451
本文介绍了在其fit方法带有3个参数的自定义类上使用sklearn GridSearchCV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个项目,该项目涉及将某些算法实现为python类并测试其性能.我决定将它们编写为sklearn估算器,以便可以使用 GridSearchCV 进行验证.

但是,我的归纳矩阵完成的算法之一不仅将Xy作为参数.这对于 ,因为似乎没有办法将Xy传递给估计器的fit方法.源显示了GridSearchCV.fit的以下参数:

def fit(self, X, y=None, groups=None, **fit_params):

当然,下游方法只需要这两个参数.显然,修改我的GridSearchCV本地副本以适应我的需求绝非易事(或不建议这样做).

作为参考,IMC基本上声明$ R \大约XW ^ THY ^ T $.因此,我的fit方法采用以下形式:

def fit(self, R, X, Y):

因此尝试以下操作失败,因为Y值永远不会传递给IMC.fit方法:

imc = IMC()
params = {...}
gs = GridSearchCV(imc, param_grid=params)
gs.fit(R, X, Y)

我已经通过修改IMC.fit方法(为此也必须将其插入到score方法中)创建了一种解决方法:

def fit(self, R, X, Y=None):
    if Y is None:
        split = np.where(np.all(X == 999, axis=0))[0][0]
        Y = X[:, split + 1:]
        X = X[:, :split]
    ...

这使我可以使用 numpy.hstack 以水平堆叠X和Y,并在它们之间插入所有999的列.然后可以将该数组传递给GridSearchCV.fit,如下所示:

data = np.hstack([X, np.ones((X.shape[0],1)) * 999, Y])
gs.fit(R, data)

此方法有效,但感觉很棘手.因此,我的问题是这样的:

是否存在使用GridSearchCV将两个以上的参数传递给fit方法的普遍接受的方法或最佳实践?

解决方案

因此,从朋友那里获得了一些启发之后(GridSearchCV for validation.

However, one of my algorithms for Inductive Matrix Completion takes more than just X and y as arguments. This becomes a problem for the GridSearchCV.fit as there appears to be no way to pass more than just X and y to the fit method of the estimator. The source shows the following arguments for GridSearchCV.fit:

def fit(self, X, y=None, groups=None, **fit_params):

And of course the downstream methods expect only these two arguments. Obviously it would be no trivial task (or advisable) to modify my local copy of GridSearchCV to accommodate my needs.

For reference IMC basically states that $ R \approx XW^THY^T $. So my fit method takes the following form:

def fit(self, R, X, Y):

So trying the following fails as the Y value never gets passed to the IMC.fit method:

imc = IMC()
params = {...}
gs = GridSearchCV(imc, param_grid=params)
gs.fit(R, X, Y)

I've created a workaround for this by modifying the IMC.fit method like so (this also has to be inserted into the score method):

def fit(self, R, X, Y=None):
    if Y is None:
        split = np.where(np.all(X == 999, axis=0))[0][0]
        Y = X[:, split + 1:]
        X = X[:, :split]
    ...

This allows me to use numpy.hstack to stack X and Y horizontally and insert a column of all 999 between them. This array can then be passed to GridSearchCV.fit as follows:

data = np.hstack([X, np.ones((X.shape[0],1)) * 999, Y])
gs.fit(R, data)

This approach works, but feels pretty hacky. Therefore my question is this:

Is there a generally accepted way or best practice for passing more than 2 arguments to a fit method using GridSearchCV?

解决方案

So after getting some inspiration from a friend on this (@Matthew Drury) I constructed a much more elegant solution.

Again the problem is framed as such:

I have a matrix completion method that takes X, Y, and R as arguments and attempts to construct W and H that minimize R - XWHY for all observed indices in R. A basic implementation of a fit method would look like this:

def fit(X, Y, R):
    W, H = do_minimization(X, Y, R)
    return W, H

This doesn't fit well into the standard sklearn model where fit takes an X (the features that feed into a model) and y (the results) and looks like this:

def fit(X, y):
    W, H = do_minimization(X, y)
    return W, H

This isn't really an issue until you start using GridSearchCV or other cross validation methods as they expect the data to fit the latter format. So to marry these two concepts I needed a way of packaging two disparate matrices X and Y into a single structure without losing the separate nature of the two.

In the 5 minutes I had to dedicate to this originally I came up with the hacky solution. In a matrix R shape n, m where the rows correspond to the records in X and the columns correspond to the records in Y, there are b total entries. If we take the row and column indices for all of these entries and index X on the rows and Y on the columns we will end up with equal length matrices for X and Y. These can then be stacked horizontally, separated by a column of nonsense, and passed to the cross validation methods without issue (we just need a couple helper methods inside the original class to reconstruct the original X and Y from the stack before fitting.

The point of this question was to find the elegant solution, or preferably an existing solution. That doesn't seem to be the case so I will propose the following model for any future estimators/classifiers built inheriting from sklearn that require more than just a single feature matrix for the fit method.

Create a DataHandler

When using GridSearchCV the fit method does a round of checks before firing off any calls to the estimators fit method. One of these determines if the passed X array is indexable. This test basically checks if X implements __getitem__ or iloc and is the same length as y. This length check requires X to have a shape attribute. At that point the split indices and fits can be computed as expected. So we need a wrapper that implements __getitem__ and has a shape attribute.

class DataHandler(object):

    def __init(self, X, Y):
        self.X = X
        self.Y = Y
        self.shape = self.X.shape

    def __getitem__(self, x):
        return self.X[x], self.Y[x]

Thats it! We can now modify the fit method to match the sklearn style, but in this case instead of X being an array, it will either be a tuple (the result returned by the __getitem__ method) or an instance of our DataHandler class.

Now GridSearchCV will work as expected by just passing an instance of a DataHandler containing the X and Y arrays.

这篇关于在其fit方法带有3个参数的自定义类上使用sklearn GridSearchCV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆