您能否使用 Sklearn 的 Transformer API 持续跟踪列标签? [英] Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

查看:18
本文介绍了您能否使用 Sklearn 的 Transformer API 持续跟踪列标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这对这个图书馆来说似乎是一个非常重要的问题,到目前为止我还没有看到一个决定性的答案,尽管在大多数情况下,答案似乎是不".

现在,任何使用 sklearn 中的 transformer api 的方法都会返回一个 numpy 数组作为其结果.通常这很好,但是如果您将扩展或减少列数的多步骤过程链接在一起,则没有一种干净的方法来跟踪它们与原始列标签的关系,因此很难使用充分利用图书馆.

举个例子,这是我最近使用的一个片段,其中无法将新列映射到数据集中原始列是一个很大的缺点:

numeric_columns = train.select_dtypes(include=np.number).columns.tolist()cat_columns = train.select_dtypes(include=np.object).columns.tolist()numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())变压器 = [('num', numeric_pipeline, numeric_columns),('cat', cat_pipeline, cat_columns)]组合管道 = ColumnTransformer(变压器)train_clean = combine_pipe.fit_transform(train)test_clean = combine_pipe.transform(test)

在这个例子中,我使用 ColumnTransformer 拆分了我的数据集,然后使用 OneHotEncoder 添加了额外的列,所以我的列排列与我开始的不同出去.

如果我使用使用相同 API 的不同模块,我很容易有不同的安排.OrdinalEncoerselect_k_best

如果您要进行多步转换,有没有办法始终如一地查看新列与原始数据集的关系?

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'

Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.

As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:

numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns     = train.select_dtypes(include=np.object).columns.tolist()

numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())

transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]

combined_pipe = ColumnTransformer(transformers)

train_clean = combined_pipe.fit_transform(train)

test_clean  = combined_pipe.transform(test)

In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.

I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.

If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?

There's an extensive discussion about it here, but I don't think anything has been finalized yet.

解决方案

Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.

Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.

From Documentation of ColumnTransformer:

Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

Try this!

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer

train = pd.DataFrame({'age': [23,12, 12, np.nan],
                      'Gender': ['M','F', np.nan, 'F'],
                      'income': ['high','low','low','medium'],
                      'sales': [10000, 100020, 110000, 100],
                      'foo' : [1,0,0,1],
                      'text': ['I will test this',
                               'need to write more sentence',
                               'want to keep it simple',
                               'hope you got that these sentences are junk'],
                      'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns     = ['Gender','income']

numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))

transformers = [
    ('num', numeric_pipeline, numeric_columns),
    ('cat', cat_pipeline, cat_columns),
    ('text', text_pipeline, 'text'),
    ('simple_transformer', MinMaxScaler(), ['sales']),
]

combined_pipe = ColumnTransformer(
    transformers, remainder='passthrough')

transformed_data = combined_pipe.fit_transform(
    train.drop('y',1), train['y'])

def get_feature_out(estimator, feature_in):
    if hasattr(estimator,'get_feature_names'):
        if isinstance(estimator, _VectorizerMixin):
            # handling all vectorizers
            return [f'vec_{f}' 
                for f in estimator.get_feature_names()]
        else:
            return estimator.get_feature_names(feature_in)
    elif isinstance(estimator, SelectorMixin):
        return np.array(feature_in)[estimator.get_support()]
    else:
        return feature_in


def get_ct_feature_names(ct):
    # handles all estimators, pipelines inside ColumnTransfomer
    # doesn't work when remainder =='passthrough'
    # which requires the input column names.
    output_features = []

    for name, estimator, features in ct.transformers_:
        if name!='remainder':
            if isinstance(estimator, Pipeline):
                current_features = features
                for step in estimator:
                    current_features = get_feature_out(step, current_features)
                features_out = current_features
            else:
                features_out = get_feature_out(estimator, features)
            output_features.extend(features_out)
        elif estimator=='passthrough':
            output_features.extend(ct._feature_names_in[features])
                
    return output_features

pd.DataFrame(transformed_data, 
             columns=get_ct_feature_names(combined_pipe))

这篇关于您能否使用 Sklearn 的 Transformer API 持续跟踪列标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆