如何使用混合离散和连续特征的互信息来选择KBest? [英] How do I SelectKBest using mutual information from a mixture of discrete and continuous features?

查看:428
本文介绍了如何使用混合离散和连续特征的互信息来选择KBest?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit学习训练分类模型.我的训练数据同时具有离散和连续特征.我想使用最大的相互信息来进行特征选择.如果我有矢量x和标签y,并且前三个特征值是离散的,则可以这样获得MMI值:

I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x and labels y and the first three feature values are discrete I can get the MMI values like so:

mutual_info_classif(x, y, discrete_features=[0, 1, 2])

现在,我想在管道中使用相同的相互信息选择.我想做这样的事情

Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this

SelectKBest(score_func=mutual_info_classif).fit(x, y)

,但是无法将离散特征蒙版传递给SelectKBest.是否有一些我可以忽略的语法来做到这一点,还是我必须编写自己的得分函数包装器?

but there's no way to pass the discrete features mask to SelectKBest. Is there some syntax to do this that I'm overlooking, or do I have to write my own score function wrapper?

推荐答案

不幸的是,我找不到SelectKBest的此功能. 但是,我们可以轻松实现的是将SelectKBest扩展为我们的自定义类,以覆盖将被调用的fit()方法.

Unfortunately I could not find this functionality for the SelectKBest. But what we can do easily is extend the SelectKBest as our custom class to override the fit() method which will be called.

这是SelectKBest的当前fit()方法(摘自

This is the current fit() method of SelectKBest (taken from source at github)

# No provision for extra parameters here
def fit(self, X, y):
    X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)

    ....
    ....

    # Here only the X, y are passed to scoring function
    score_func_ret = self.score_func(X, y)

    ....        
    ....

    self.scores_ = np.asarray(self.scores_)

    return self

现在,我们将使用更改后的fit()定义新的类SelectKBestCustom.我已经从上述源复制了所有内容,仅更改了两行(对此进行了评论):

Now we will define our new class SelectKBestCustom with the changed fit(). I have copied everything from the above source, changing only two lines (commented about it):

from sklearn.utils import check_X_y

class SelectKBestCustom(SelectKBest):

    # Changed here
    def fit(self, X, y, discrete_features='auto'):
        X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)

        if not callable(self.score_func):
            raise TypeError("The score function should be a callable, %s (%s) "
                        "was passed."
                        % (self.score_func, type(self.score_func)))

        self._check_params(X, y)

        # Changed here also
        score_func_ret = self.score_func(X, y, discrete_features)
        if isinstance(score_func_ret, (list, tuple)):
            self.scores_, self.pvalues_ = score_func_ret
            self.pvalues_ = np.asarray(self.pvalues_)
        else:
            self.scores_ = score_func_ret
            self.pvalues_ = None

        self.scores_ = np.asarray(self.scores_)
        return self

这可以简单地称为:

clf = SelectKBestCustom(mutual_info_classif,k=2)
clf.fit(X, y, discrete_features=[0, 1, 2])

修改: 上述解决方案在管道中也可能有用,并且在调用fit()时可以为discrete_features参数分配不同的值.

Edit: The above solution can be useful in pipelines also, and the discrete_features parameter can be assigned different values when calling fit().

另一种解决方案(较不受欢迎): 不过,如果您只需要暂时将SelectKBestmutual_info_classif一起使用(仅分析结果),我们还可以创建一个自定义函数,该函数可以使用硬编码的discrete_features在内部调用mutual_info_classif.类似于以下内容:

Another Solution (less preferable): Still, if you just need to work SelectKBest with mutual_info_classif, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif internally with hard coded discrete_features. Something along the lines of:

def mutual_info_classif_custom(X, y):
    # To change discrete_features, 
    # you need to redefine the function each time
    # Because once the func def is supplied to selectKBest, it cant be changed
    discrete_features = [0, 1, 2]

    return mutual_info_classif(X, y, discrete_features)

上述功能的使用:

selector = SelectKBest(mutual_info_classif_custom).fit(X, y)

这篇关于如何使用混合离散和连续特征的互信息来选择KBest?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆