如何从sklearn管道输出Pandas对象 [英] How to output Pandas object from sklearn pipeline

查看:72
本文介绍了如何从sklearn管道输出Pandas对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了一个管道,该管道采用了已分为分类和数字列的熊猫数据框.我试图在结果上运行GridSearchCV,并最终查看GridSearchCV选择的最佳性能模型的重要性排名.我遇到的问题是sklearn管道输出numpy数组对象,并在此过程中丢失任何列信息.因此,当我检查模型的最重要系数时,会留下未标记的numpy数组.

I have constructed a pipeline that takes a pandas dataframe that has been split into categorical and numerical columns. I am trying to run GridSearchCV on my results and ultimately look at the ranked features of importance for the best performing model that GridSearchCV selects. The problem I am encountering is that sklearn pipelines output numpy array objects and lose any column information along the way. Thus when I go to examine the most important coefficients of the model I am left with an unlabeled numpy array.

我已经读到,构建自定义转换器可能是解决此问题的一种方法,但我本人没有任何这样做的经验.我还研究了利用sklearn-pandas软件包的方法,但是我很犹豫尝试实现一些可能无法与sklearn并行更新的内容.任何人都可以提出他们认为是解决此问题的最佳途径的建议吗?我也欢迎任何涉及熊猫和sklearn管道应用的文献.

I have read that building a custom transformer might be a possible solution to this, but I do not have any experience doing so myself. I have also looked into leveraging the sklearn-pandas package, but I am hesitant to try and implement something that might not be updated in parallel with sklearn. Can anyone suggest what they believe is the best path to go about getting around this issue? I am also open to any literature that has hands on application of pandas and sklearn pipelines.

我的管道:

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])

交叉验证:

kf = KFold(n_splits=4, shuffle=True, random_state=44)

cross_val_score(clf, X_train, y_train, cv=kf).mean()

网格搜索:

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)

检查系数:

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_

以下是在海洋"mpg"数据集上执行时当前模型系数的输出:

Here is the output of the model coefficients as it currently stands when performed on the seaborn "mpg" dataset:

array([-4.64782052e-01,  1.47805207e+00, -3.28948689e-01, -5.37033173e+00,
        2.80000700e-01,  2.71523808e+00,  6.29170887e-01,  9.51627968e-01,
       ...
       -1.50574860e+00,  1.88477450e+00,  4.57285471e+00, -6.90459868e-01,
        5.49416409e+00])

理想情况下,我想保留pandas数据帧信息,并在调用OneHotEncoder和其他方法之后检索派生的列名.

Ideally I would like to preserve the pandas dataframe information and retrieve the derived column names after OneHotEncoder and the other methods are called.

推荐答案

我实际上是根据输入创建列名的.如果您的输入已经分为数字类别,则可以使用pd.get_dummies来获取每个类别特征的不同类别的编号.

I would actually go for creating column names from the input. If your input is already divided into numerical an categorical you can use pd.get_dummies to get the number of different category for each categorical feature.

然后,您可以根据问题以及一些人工数据,为该列创建适当的名称,如本工作示例的最后一部分所示.

then you can just create proper names for the columns as shown in the last part of this working example based on the question with some artificial data.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']

X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])


kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ',  model.named_steps['ridge'].coef_, '\n')

# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)

print('columns after preprocessing :', columns_names_to_map,  '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))

上面的代码(最后)打印出以下数据帧,该数据帧是从参数名称到参数值的映射:

the code above (in the end) prints out the following dataframe which is a map from parameter name to parameter value:

这篇关于如何从sklearn管道输出Pandas对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆