将多个StandardScaler应用于单个组? [英] Apply multiple StandardScaler's to individual groups?

查看:123
本文介绍了将多个StandardScaler应用于单个组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否存在将sklearn的StandardScaler实例链接在一起以独立地按组缩放数据的pythonic方法?即,如果我想独立地找到虹膜数据集的特征;我可以使用以下代码:

Is there a pythonic way to chain together sklearn's StandardScaler instances to independently scale data with groups? I.e., if I wanted to find independently scale the features of the iris dataset; I could use the following code:

from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['class'] = data['target']

means = df.groupby('class').mean()
stds = df.groupby('class').std()

df_rescaled = (
    (df.drop(['class'], 1) - means.reindex(df['class']).values) / 
     stds.reindex(df['class']).values)

在这里,我分别用平均值减去并除以每个组的stdev.但是,在我要控制的类别变量时,很难理解这些方法和stdev,并且基本上复制StandardScaler的行为.

Here, I'm subtracting by the mean and dividing by the stdev of each group independently. But Its somewhat hard to carry around these means and stdev's, and essentially, replicate the behavior of StandardScaler when I have a categorical variable I'd like to control for.

是否存在更Python/sklearn友好的方式来实现这种缩放?

Is there a more pythonic / sklearn-friendly way to implement this type of scaling?

推荐答案

当然,您可以使用任何sklearn操作并将其应用于groupby对象.

Sure, you can use any sklearn operation and apply it to a groupby object.

首先,提供一些便利包装:

First, a little convenience wrapper:

import typing
import pandas as pd

class SklearnWrapper:
    def __init__(self, transform: typing.Callable):
        self.transform = transform

    def __call__(self, df):
        transformed = self.transform.fit_transform(df.values)
        return pd.DataFrame(transformed, columns=df.columns, index=df.index)

这将把您传递给它的任何sklearn变换应用于一个组.

This one will apply any sklearn transform you pass into it to a group.

最后是简单的用法:

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]

df_rescaled = (
    df.groupby("class")
    .apply(SklearnWrapper(StandardScaler()))
    .drop("class", axis="columns")
)

编辑:使用SklearnWrapper您几乎可以执行任何操作. 这是对每个组进行转换和反转此操作的示例(例如,不要覆盖转换对象)-每次看到新组时都重新安装该对象(并将其添加到list ).

You can pretty much do anything with SklearnWrapper. Here is an example of transforming and reversing this operation for each group (e.g. do not overwrite the transformation object) - just fit the object anew each time a new group is seen (and add it to list).

我还复制了一些sklearn's功能以便于使用(您可以通过将适当的string传递给_call_with_function内部方法,将其扩展为所需的任何功能):

I have kinda replicated a bit of sklearn's functionality for easier usage (you can extend it with any function you want by passing appropriate string to _call_with_function internal method):

class SklearnWrapper:
    def __init__(self, transformation: typing.Callable):
        self.transformation = transformation
        self._group_transforms = []
        # Start with -1 and for each group up the pointer by one
        self._pointer = -1

    def _call_with_function(self, df: pd.DataFrame, function: str):
        # If pointer >= len we are making a new apply, reset _pointer
        if self._pointer >= len(self._group_transforms):
            self._pointer = -1
        self._pointer += 1
        return pd.DataFrame(
            getattr(self._group_transforms[self._pointer], function)(df.values),
            columns=df.columns,
            index=df.index,
        )

    def fit(self, df):
        self._group_transforms.append(self.transformation.fit(df.values))
        return self

    def transform(self, df):
        return self._call_with_function(df, "transform")

    def fit_transform(self, df):
        self.fit(df)
        return self.transform(df)

    def inverse_transform(self, df):
        return self._call_with_function(df, "inverse_transform")

用法(组变换,逆运算并再次应用):

Usage (group transform, inverse operation and apply it again):

data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]

# Create scaler outside the class
scaler = SklearnWrapper(StandardScaler())

# Fit and transform data (holding state)
df_rescaled = df.groupby("class").apply(scaler.fit_transform)

# Inverse the operation
df_inverted = df_rescaled.groupby("class").apply(scaler.inverse_transform)

# Apply transformation once again
df_transformed = (
    df_inverted.groupby("class")
    .apply(scaler.transform)
    .drop("class", axis="columns")
)

这篇关于将多个StandardScaler应用于单个组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆