您可以使用Sklearn的Transformer API始终跟踪列标签吗? [英] Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?
问题描述
对于该库来说,这似乎是一个非常重要的问题,到目前为止,我没有一个决定性的答案,尽管在大多数情况下,答案是否".
This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
现在,任何在sklearn
中使用transformer
api的方法都将返回numpy
数组作为结果.通常这很好,但是如果您将一个扩展或减少列数的多步骤过程链接在一起,则没有一种清晰的方法来跟踪它们与原始列标签的关系将使使用此部分内容变得困难.充分发挥图书馆的作用.
Right now, any method that uses the transformer
api in sklearn
returns a numpy
array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
作为示例,这是我最近使用的一个片段,其中无法将新列映射到数据集中的原始列是一个很大的缺点:
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
在此示例中,我使用ColumnTransformer
拆分了数据集,然后使用OneHotEncoder
添加了其他列,因此列的排列方式与我开始时的排列方式不同.
In this example I split up my dataset using the ColumnTransformer
and then added additional columns using the OneHotEncoder
, so my arrangement of columns is not the same as what I started out with.
如果我使用了使用相同API的不同模块,那么我很容易会有不同的安排. OrdinalEncoer
,select_k_best
等
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer
, select_k_best
, etc.
如果您要进行多步转换,是否可以始终如一地查看新列与原始数据集之间的关系?
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
在此处进行了广泛的讨论,但我不知道还以为还没有完成任何事情.
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
推荐答案
是的,您说对了,不支持在sklearn
中跟踪功能名称.决定在numpy数组级别将其保留为通用.
yes, you are right that there is no support for tracking the feature_names in sklearn
. It is decide to keep it as generic at the level of numpy array.
无论如何,我们可以创建包装器来获取ColumnTransformer
的功能名称.我不确定它是否可以捕获ColumnTransformers
的所有可能类型.但至少可以解决您的问题.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer
. I am not sure whether it can capture all the possible types of ColumnTransformers
. But at-least, it can solve your problem.
尝试一下!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
transformed_data = combined_pipe.fit_transform(train)
def get_transformer_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
get_transformer_feature_names(combined_pipe)
# ['age', 'Gender_F', 'Gender_M', 'income_high', 'income_low', 'income_medium']
pd.DataFrame(transformed_data,
columns=get_transformer_feature_names(combined_pipe))
这篇关于您可以使用Sklearn的Transformer API始终跟踪列标签吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!