避免在 sci-kit 中缩放二进制列学习 StandsardScaler [英] Avoid scaling binary columns in sci-kit learn StandsardScaler

查看:34
本文介绍了避免在 sci-kit 中缩放二进制列学习 StandsardScaler的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 sci-kit learn 中构建线性回归模型,并将输入缩放作为 sci-kit learn 管道中的预处理步骤.有什么办法可以避免缩放二进制列吗?发生的事情是这些列与其他列一起缩放,导致值以 0 为中心,而不是 0 或 1,所以我得到像 [-0.6, 0.3] 这样的值,这导致输入值为 0影响我的线性模型中的预测.

用于说明的基本代码:

<预><代码>>>>将 numpy 导入为 np>>>从 sklearn.pipeline 导入管道>>>从 sklearn.preprocessing 导入 StandardScaler>>>从 sklearn.linear_model 导入岭>>>X = np.hstack( (np.random.random((1000, 2)),np.random.randint(2, size=(1000, 2))) )>>>X数组([[ 0.30314072, 0.22981496, 1. , 1. ],[ 0.08373292, 0.66170678, 1. , 0. ],[ 0.76279599, 0.36658793, 1. , 0. ],...,[ 0.81517519, 0.40227095, 0. , 0. ],[ 0.21244587, 0.34141014, 0. , 0. ],[ 0.2328417 , 0.14119217, 0. , 0. ]])>>>缩放器 = StandardScaler()>>>scaler.fit_transform(X)数组([[-0.67768374, -0.95108883, 1.00803226, 1.03667198],[-1.43378124, 0.53576375, 1.00803226, -0.96462528],[ 0.90632643, -0.48022732, 1.00803226, -0.96462528],...,[ 1.08682952, -0.35738315, -0.99203175, -0.96462528],[-0.99022572, -0.56690563, -0.99203175, -0.96462528],[-0.91994001, -1.25618613, -0.99203175, -0.96462528]])

我希望最后一行的输出是:

<预><代码>>>>scaler.fit_transform(X,dont_scale_binary_or_something=True)数组([[-0.67768374, -0.95108883, 1. , 1. ],[-1.43378124, 0.53576375, 1. , 0. ],[ 0.90632643, -0.48022732, 1. , 0. ],...,[ 1.08682952, -0.35738315, 0. , 0. ],[-0.99022572, -0.56690563, 0. , 0. ],[-0.91994001, -1.25618613, 0. , 0. ]])

有什么办法可以做到这一点?我想我可以只选择不是二进制的列,只转换它们,然后将转换后的值替换回数组,但我希望它与 sci-kit learn Pipeline 工作流程很好地配合,所以我可以做类似的事情:

clf = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())])clf.set_params(scaler__dont_scale_binary_features=True, ridge__alpha=0.04).fit(X, y)

解决方案

我发布了根据@miindlek 的回复改编的代码,以防对其他人有帮助.我在不包含 BaseEstimator 时遇到错误.再次感谢@miindlek.下面,bin_vars_index 是二进制变量的列索引数组,cont_vars_index 与要缩放的连续变量相同.

from sklearn.preprocessing import StandardScaler从 sklearn.base 导入 BaseEstimator,TransformerMixin将 numpy 导入为 np类 CustomScaler(BaseEstimator,TransformerMixin):# 注意:返回二进制列先排序的特征矩阵def __init__(self,bin_vars_index,cont_vars_index,copy=True,with_mean=True,with_std=True):self.scaler = StandardScaler(copy,with_mean,with_std)self.bin_vars_index = bin_vars_indexself.cont_vars_index = cont_vars_indexdef fit(self, X, y=None):self.scaler.fit(X[:,self.cont_vars_index], y)回归自我def 变换(self, X, y=None, copy=None):X_tail = self.scaler.transform(X[:,self.cont_vars_index],y,copy)返回 np.concatenate((X[:,self.bin_vars_index],X_tail),axis=1)

I'm building a linear regression model in sci-kit learn, and am scaling the inputs as a preprocessing step in a sci-kit learn Pipeline. Is there any way I can avoid scaling binary columns? What's happening is that these columns are being scaled with every other column, causing the values to be centered around 0, rather than being 0 or 1, so I'm getting values like [-0.6, 0.3], which cause input values of 0 to influence predictions in my linear model.

Basic code to illustrate:

>>> import numpy as np
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import Ridge
>>> X = np.hstack( (np.random.random((1000, 2)),
                np.random.randint(2, size=(1000, 2))) )
>>> X
array([[ 0.30314072,  0.22981496,  1.        ,  1.        ],
       [ 0.08373292,  0.66170678,  1.        ,  0.        ],
       [ 0.76279599,  0.36658793,  1.        ,  0.        ],
       ...,
       [ 0.81517519,  0.40227095,  0.        ,  0.        ],
       [ 0.21244587,  0.34141014,  0.        ,  0.        ],
       [ 0.2328417 ,  0.14119217,  0.        ,  0.        ]])
>>> scaler = StandardScaler()
>>> scaler.fit_transform(X)
array([[-0.67768374, -0.95108883,  1.00803226,  1.03667198],
       [-1.43378124,  0.53576375,  1.00803226, -0.96462528],
       [ 0.90632643, -0.48022732,  1.00803226, -0.96462528],
       ...,
       [ 1.08682952, -0.35738315, -0.99203175, -0.96462528],
       [-0.99022572, -0.56690563, -0.99203175, -0.96462528],
       [-0.91994001, -1.25618613, -0.99203175, -0.96462528]])

I'd love for the output of the last line to be:

>>> scaler.fit_transform(X, dont_scale_binary_or_something=True)
array([[-0.67768374, -0.95108883,  1.        ,  1.        ],
       [-1.43378124,  0.53576375,  1.        ,  0.        ],
       [ 0.90632643, -0.48022732,  1.        ,  0.        ],
       ...,
       [ 1.08682952, -0.35738315,  0.        ,  0.        ],
       [-0.99022572, -0.56690563,  0.        ,  0.        ],
       [-0.91994001, -1.25618613,  0.        ,  0.        ]])

Any way I can accomplish this? I suppose I could just select the columns that aren't binary, only transform those, then replace the transformed values back into the array, but I'd like it to play nicely with the sci-kit learn Pipeline workflow, so I can just do something like:

clf = Pipeline([('scaler', StandardScaler()), ('ridge', Ridge())])
clf.set_params(scaler__dont_scale_binary_features=True, ridge__alpha=0.04).fit(X, y)

解决方案

I'm posting code that I adapted from @miindlek's response just in case it is helpful to others. I encountered an error when I didn't include BaseEstimator. Thank you again @miindlek. Below, bin_vars_index is an array of column indexes for the binary variable and cont_vars_index is the same for the continuous variables that you want to scale.

from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class CustomScaler(BaseEstimator,TransformerMixin): 
    # note: returns the feature matrix with the binary columns ordered first  
    def __init__(self,bin_vars_index,cont_vars_index,copy=True,with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.bin_vars_index = bin_vars_index
        self.cont_vars_index = cont_vars_index

    def fit(self, X, y=None):
        self.scaler.fit(X[:,self.cont_vars_index], y)
        return self

    def transform(self, X, y=None, copy=None):
        X_tail = self.scaler.transform(X[:,self.cont_vars_index],y,copy)
        return np.concatenate((X[:,self.bin_vars_index],X_tail), axis=1)

这篇关于避免在 sci-kit 中缩放二进制列学习 StandsardScaler的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆